Re: storing indexes on ssd

2018-02-13 Thread Dan Kinder
On a single node that's a bit less than half full, the index files are 87G.

How will OS disk cache know to keep the index file blocks cached but not
cache blocks from the data files? As far as I know it is not smart enough
to handle that gracefully.

Re: ram expensiveness, see
https://www.extremetech.com/computing/263031-ram-prices-roof-stuck-way --
it's really not an important point though, ram is still far more expensive
than disk, regardless of whether the price has been going up.

On Tue, Feb 13, 2018 at 12:02 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Feb 13, 2018 at 1:30 AM, Dan Kinder <dkin...@turnitin.com> wrote:
>
>> Created https://issues.apache.org/jira/browse/CASSANDRA-14229
>>
>
> This is confusing.  You've already started the conversation here...
>
> How big are your index files in the end?  Even if Cassandra doesn't cache
> them in or (off-) heap, they might as well just fit into the OS disk cache.
>
> From your ticket description:
> > ... as ram continues to get more expensive,..
>
> Where did you get that from?  I would expect quite the opposite.
>
> Regards,
> --
> Alex
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: storing indexes on ssd

2018-02-12 Thread Dan Kinder
Created https://issues.apache.org/jira/browse/CASSANDRA-14229

On Mon, Feb 12, 2018 at 12:10 AM, Mateusz Korniak <
mateusz-li...@ant.gliwice.pl> wrote:

> On Saturday 10 of February 2018 23:09:40 Dan Kinder wrote:
> > We're optimizing Cassandra right now for fairly random reads on a large
> > dataset. In this dataset, the values are much larger than the keys. I was
> > wondering, is it possible to have Cassandra write the *index* files
> > (*-Index.db) to one drive (SSD), but write the *data* files (*-Data.db)
> to
> > another (HDD)? This would be an overall win for us since it's
> > cost-prohibitive to store the data itself all on SSD, but we hit the
> limits
> > if we just use HDD; effectively we would need to buy double, since we are
> > doing 2 random reads (index + data).
>
> Considered putting cassandra data on lvmcache?
> We are using this on small (3x2TB compressed data, 128/256MB cache)
> clusters
> since reaching I/O limits of 2xHDD in RAID10.
>
>
> --
> Mateusz Korniak
> "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
> krótko mówiąc - podpora społeczeństwa."
> Nikos Kazantzakis - "Grek Zorba"
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Setting min_index_interval to 1?

2018-02-12 Thread Dan Kinder
@Hannu this was based on the assumption that if we receive a read for a key
that is sampled, it'll be treated as cached and won't go to the index on
disk. Part of my question was whether that's the case, I'm not sure.

Btw I ended up giving up on this, trying the key cache route already showed
that it would require more memory than we have available. And even then,
the performance started to tank; we saw irqbalance and other processes peg
the CPU even with not too much load, so there was some numa-related problem
there that I don't have time to look into.

On Fri, Feb 2, 2018 at 12:42 AM, Hannu Kröger <hkro...@gmail.com> wrote:

> Wouldn’t that still try to read the index on the disk? So you would just
> potentially have all keys on the memory and on the disk and reading would
> first happen in memory and then on the disk and only after that you would
> read the sstable.
>
> So you wouldn’t gain much, right?
>
> Hannu
>
> On 2 Feb 2018, at 02:25, Nate McCall <n...@thelastpickle.com> wrote:
>
>
>> Another was the crazy idea I started with of setting min_index_interval
>> to 1. My guess was that this would cause it to read all index entries, and
>> effectively have them all cached permanently. And it would read them
>> straight out of the SSTables on every restart. Would this work? Other than
>> probably causing a really long startup time, are there issues with this?
>>
>>
> I've never tried that. It sounds like you understand the potential impact
> on memory and startup time. If you have the data in such a way that you can
> easily experiment, I would like to see a breakdown of the impact on
> response time vs. memory usage as well as where the point of diminishing
> returns is on turning this down towards 1 (I think there will be a sweet
> spot somewhere).
>
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


storing indexes on ssd

2018-02-10 Thread Dan Kinder
Hi,

We're optimizing Cassandra right now for fairly random reads on a large
dataset. In this dataset, the values are much larger than the keys. I was
wondering, is it possible to have Cassandra write the *index* files
(*-Index.db) to one drive (SSD), but write the *data* files (*-Data.db) to
another (HDD)? This would be an overall win for us since it's
cost-prohibitive to store the data itself all on SSD, but we hit the limits
if we just use HDD; effectively we would need to buy double, since we are
doing 2 random reads (index + data).

Thanks,
-dan


Setting min_index_interval to 1?

2018-02-01 Thread Dan Kinder
Hi, I have an unusual case here: I'm wondering what will happen if I
set min_index_interval to 1.

Here's the logic. Suppose I have a table where I really want to squeeze as
many reads/sec out of it as possible, and where the row data size is much
larger than the keys. E.g. the keys are a few bytes, the row data is ~500KB.

This table would be a great candidate for key caching. Let's suppose I have
enough memory to have every key cached. However, it's a lot of data, and
the reads are very random. So it would take a very long time for that cache
to warm up.

One solution is that I write a little app to go through every key to warm
it up manually, and ensure that Cassandra has key_cache_keys_to_save set to
save the whole thing on restart. (Anyone know of a better way of doing
this?)

Another was the crazy idea I started with of setting min_index_interval to
1. My guess was that this would cause it to read all index entries, and
effectively have them all cached permanently. And it would read them
straight out of the SSTables on every restart. Would this work? Other than
probably causing a really long startup time, are there issues with this?

Thanks,
-dan


LCS major compaction on 3.2+ on JBOD

2017-10-05 Thread Dan Kinder
Hi

I am wondering how major compaction behaves for a table using LCS on JBOD
with Cassandra 3.2+'s JBOD improvements.

Up to then I know that major compaction would use a single thread, include
all SSTables in a single compaction, and spit out a bunch of SSTables in
appropriate levels.

Does 3.2+ do 1 compaction per disk, since they are separate leveled
structures? Or does it do a single compaction task that writes SSTables to
the appropriate disk by key range?

-dan


Re:

2017-10-02 Thread Dan Kinder
Created https://issues.apache.org/jira/browse/CASSANDRA-13923

On Mon, Oct 2, 2017 at 12:06 PM, Dan Kinder <dkin...@turnitin.com> wrote:

> Sure will do.
>
> On Mon, Oct 2, 2017 at 11:48 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>
>> You're right, sorry I didnt read the full stack (gmail hid it from me)
>>
>> Would you open a JIRA with your stack traces, and note (somewhat loudly)
>> that this is a regression?
>>
>>
>> On Mon, Oct 2, 2017 at 11:43 AM, Dan Kinder <dkin...@turnitin.com> wrote:
>>
>>> Right, I just meant that calling it at all results in holding a read
>>> lock, which unfortunately is blocking these read threads.
>>>
>>> On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>>
>>>>
>>>> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder <dkin...@turnitin.com>
>>>> wrote:
>>>>
>>>>> (As a side note, it seems silly to call shouldDefragment at all on a
>>>>> read if the compaction strategy is not STSC)
>>>>>
>>>>>
>>>>>
>>>> It defaults to false:
>>>>
>>>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/j
>>>> ava/org/apache/cassandra/db/compaction/AbstractCompactionStr
>>>> ategy.java#L302
>>>>
>>>> And nothing else other than STCS overrides it to true.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Dan Kinder
>>> Principal Software Engineer
>>> Turnitin – www.turnitin.com
>>> dkin...@turnitin.com
>>>
>>
>>
>
>
> --
> Dan Kinder
> Principal Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>



-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re:

2017-10-02 Thread Dan Kinder
Sure will do.

On Mon, Oct 2, 2017 at 11:48 AM, Jeff Jirsa <jji...@gmail.com> wrote:

> You're right, sorry I didnt read the full stack (gmail hid it from me)
>
> Would you open a JIRA with your stack traces, and note (somewhat loudly)
> that this is a regression?
>
>
> On Mon, Oct 2, 2017 at 11:43 AM, Dan Kinder <dkin...@turnitin.com> wrote:
>
>> Right, I just meant that calling it at all results in holding a read
>> lock, which unfortunately is blocking these read threads.
>>
>> On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>>
>>>
>>> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder <dkin...@turnitin.com>
>>> wrote:
>>>
>>>> (As a side note, it seems silly to call shouldDefragment at all on a
>>>> read if the compaction strategy is not STSC)
>>>>
>>>>
>>>>
>>> It defaults to false:
>>>
>>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/j
>>> ava/org/apache/cassandra/db/compaction/AbstractCompactionStr
>>> ategy.java#L302
>>>
>>> And nothing else other than STCS overrides it to true.
>>>
>>>
>>>
>>
>>
>> --
>> Dan Kinder
>> Principal Software Engineer
>> Turnitin – www.turnitin.com
>> dkin...@turnitin.com
>>
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re:

2017-10-02 Thread Dan Kinder
Right, I just meant that calling it at all results in holding a read lock,
which unfortunately is blocking these read threads.

On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa <jji...@gmail.com> wrote:

>
>
> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder <dkin...@turnitin.com> wrote:
>
>> (As a side note, it seems silly to call shouldDefragment at all on a read
>> if the compaction strategy is not STSC)
>>
>>
>>
> It defaults to false:
>
> https://github.com/apache/cassandra/blob/cassandra-3.0/
> src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy.
> java#L302
>
> And nothing else other than STCS overrides it to true.
>
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re:

2017-09-28 Thread Dan Kinder
Sorry, for that ReadStage exception, I take it back, accidentally ended up
too early in the logs. This node that has building ReadStage shows no
exceptions in the logs.

nodetool tpstats
Pool Name Active   Pending  Completed   Blocked
 All time blocked
ReadStage  8  1882  45881 0
0
MiscStage  0 0  0 0
0
CompactionExecutor 9 9   2551 0
0
MutationStage  0 0   35929880 0
0
GossipStage0 0  35793 0
0
RequestResponseStage   0 0 751285 0
0
ReadRepairStage0 0224 0
0
CounterMutationStage   0 0  0 0
0
MemtableFlushWriter0 0111 0
0
MemtablePostFlush  0 0239 0
0
ValidationExecutor 0 0  0 0
0
ViewMutationStage  0 0  0 0
0
CacheCleanupExecutor   0 0  0 0
0
PerDiskMemtableFlushWriter_10  0 0104 0
0
PerDiskMemtableFlushWriter_11  0 0104 0
0
MemtableReclaimMemory  0 0116 0
0
PendingRangeCalculator 0 0 16 0
0
SecondaryIndexManagement   0 0  0 0
0
HintsDispatcher0 0 13 0
0
PerDiskMemtableFlushWriter_1   0 0104 0
0
Native-Transport-Requests  0 02607030 0
0
PerDiskMemtableFlushWriter_2   0 0104 0
0
MigrationStage 0 0278 0
0
PerDiskMemtableFlushWriter_0   0 0115 0
0
Sampler0 0  0 0
0
PerDiskMemtableFlushWriter_5   0 0104 0
0
InternalResponseStage  0 0298 0
0
PerDiskMemtableFlushWriter_6   0 0104 0
0
PerDiskMemtableFlushWriter_3   0 0104 0
0
PerDiskMemtableFlushWriter_4   0 0104 0
0
PerDiskMemtableFlushWriter_9   0 0104 0
0
AntiEntropyStage   0 0  0 0
0
PerDiskMemtableFlushWriter_7   0 0104 0
0
PerDiskMemtableFlushWriter_8   0 0104 0
0

Message type   Dropped
READ 0
RANGE_SLICE  0
_TRACE   0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE  0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR  0


On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder <dkin...@turnitin.com> wrote:

> Thanks for the responses.
>
> @Prem yes this is after the entire cluster is on 3.11, but no I did not
> run upgradesstables yet.
>
> @Thomas no I don't see any major GC going on.
>
> @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
> bring it back (thankfully this cluster is not serving live traffic). The
> nodes seemed okay for an hour or two, but I see the issue again, without me
> bouncing any nodes. This time it's ReadStage that's building up, and the
> exception I'm seeing in the logs is:
>
> DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242
> - Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(6150926370328526396, 696a6374652e6f7267) (
> 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)
>
> at 
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
&

Re:

2017-09-28 Thread Dan Kinder
Thanks for the responses.

@Prem yes this is after the entire cluster is on 3.11, but no I did not run
upgradesstables yet.

@Thomas no I don't see any major GC going on.

@Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
bring it back (thankfully this cluster is not serving live traffic). The
nodes seemed okay for an hour or two, but I see the issue again, without me
bouncing any nodes. This time it's ReadStage that's building up, and the
exception I'm seeing in the logs is:

DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(6150926370328526396, 696a6374652e6f7267)
(2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)

at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_71]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_71]

at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]


Do you think running upgradesstables would help? Or relocatesstables? I
presumed it shouldn't be necessary for Cassandra to function, just an
optimization.

On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Dan,
>
>
>
> do you see any major GC? We have been hit by the following memory leak in
> our loadtest environment with 3.11.0.
>
> https://issues.apache.org/jira/browse/CASSANDRA-13754
>
>
>
> So, depending on the heap size and uptime, you might get into heap
> troubles.
>
>
>
> Thomas
>
>
>
> *From:* Dan Kinder [mailto:dkin...@turnitin.com]
> *Sent:* Donnerstag, 28. September 2017 18:20
> *To:* user@cassandra.apache.org
> *Subject:*
>
>
>
> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
> at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found 

Re:

2017-09-28 Thread Dan Kinder
I should also note, I also see nodes become locked up without seeing that
Exception. But the GossipStage buildup does seem correlated with gossip
activity, e.g. me restarting a different node.

On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder <dkin...@turnitin.com> wrote:

> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
> at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found that looked similar is https://issues.apache.org/
> jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>
>
> $ nodetool tpstats
>
> Pool Name Active   Pending  Completed
> Blocked  All time blocked
>
> ReadStage  0 0 582103 0
> 0
>
> MiscStage  0 0  0 0
> 0
>
> CompactionExecutor1111   2868 0
> 0
>
> MutationStage 32   4593678   55057393 0
> 0
>
> GossipStage1  2818 371487 0
> 0
>
> RequestResponseStage   0 04345522 0
> 0
>
> ReadRepairStage0 0 151473 0
> 0
>
> CounterMutationStage   0 0  0 0
> 0
>
> MemtableFlushWriter181 76 0
> 0
>
> MemtablePostFlush  1   382139 0
> 0
>
> ValidationExecutor 0 0  0 0
> 0
>
> ViewMutationStage  0 0  0 0
> 0
>
> CacheCleanupExecutor   0 0  0 0
> 0
>
> PerDiskMemtableFlushWriter_10  0 0 69 0
> 0
>
> PerDiskMemtableFlushWriter_11  0 0 69 0
> 0
>
> MemtableReclaimMemory  0 0 81 0
> 0
>
> PendingRangeCa

[no subject]

2017-09-28 Thread Dan Kinder
Hi,

I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
following. The cluster does function, for a while, but then some stages
begin to back up and the node does not recover and does not drain the
tasks, even under no load. This happens both to MutationStage and
GossipStage.

I do see the following exception happen in the logs:


ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
CassandraDaemon.java:228 - Exception in thread
Thread[ReadRepairStage:2328,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
received only 1 responses.

at
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_91]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_91]

at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
~[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]


But it's hard to correlate precisely with things going bad. It is also very
strange to me since I have both read_repair_chance and
dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
confusing why ReadRepairStage would err.

Anyone have thoughts on this? It's pretty muddling, and causes nodes to
lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
If I can't find a resolution I'm going to need to downgrade and restore to
backup...

The only issue I found that looked similar is
https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to
be fixed by 3.10.


$ nodetool tpstats

Pool Name Active   Pending  Completed   Blocked
All time blocked

ReadStage  0 0 582103 0
  0

MiscStage  0 0  0 0
  0

CompactionExecutor1111   2868 0
  0

MutationStage 32   4593678   55057393 0
  0

GossipStage1  2818 371487 0
  0

RequestResponseStage   0 04345522 0
  0

ReadRepairStage0 0 151473 0
  0

CounterMutationStage   0 0  0 0
  0

MemtableFlushWriter181 76 0
  0

MemtablePostFlush  1   382139 0
  0

ValidationExecutor 0 0  0 0
  0

ViewMutationStage  0 0  0 0
  0

CacheCleanupExecutor   0 0  0 0
  0

PerDiskMemtableFlushWriter_10  0 0 69 0
  0

PerDiskMemtableFlushWriter_11  0 0 69 0
  0

MemtableReclaimMemory  0 0 81 0
  0

PendingRangeCalculator 0 0 32 0
  0

SecondaryIndexManagement   0 0  0 0
  0

HintsDispatcher0 0596 0
  0

PerDiskMemtableFlushWriter_1   0 0 69 0
  0

Native-Transport-Requests 11 04547746 0
  67

PerDiskMemtableFlushWriter_2   0 0 69 0
  0

MigrationStage 1  1545586 0
  0

PerDiskMemtableFlushWriter_0   0 0 80 0
  0

Sampler0 0  0 0
  0

PerDiskMemtableFlushWriter_5   0 0 69 0
  

Re: Problems with large partitions and compaction

2017-02-15 Thread Dan Kinder
What Cassandra version? CMS or G1? What are your timeouts set to?

"GC activity"  - Even if there isn't a lot of activity per se maybe there
is a single long pause happening. I have seen large partitions cause lots
of allocation fast.

Looking at SSTable Levels in nodetool cfstats can help, look at it for all
your tables.

Don't recommend switching to STCS until you know more. You end up with
massive compaction that takes a long time to settle down.

On Tue, Feb 14, 2017 at 5:50 PM, John Sanda <john.sa...@gmail.com> wrote:

> I have a table that uses LCS and has wound up with partitions upwards of
> 700 MB. I am seeing lots of the large partition warnings. Client requests
> are subsequently failing. The driver is not reporting timeout exception,
> just NoHostAvailableExceptions (in the logs I have reviewed so far). I know
> that I need to redesign the table to avoid such large partitions. What
> specifically goes wrong that results in the instability I am seeing? Or put
> another way, what issues will compacting really large partitions cause?
> Initially I thought that there was high GC activity, but after closer
> inspection that does not really seem to happening. And most of the failures
> I am seeing are on reads, but for an entirely different table. Lastly, does
> anyone has anyone had success to switching to STCS in this situation as a
> work around?
>
> Thanks
>
> - John
>



-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Cassandra Golang Driver and Support

2016-04-14 Thread Dan Kinder
Just want to put a plug in for gocql and the guys who work on it. I use it
for production applications that sustain ~10,000 writes/sec on an 8 node
cluster and in the few times I have seen problems they have been responsive
on issues and pull requests. Once or twice I have seen the API change but
otherwise it has been stable. In general I have found it very intuitive to
use and easy to configure.

On Thu, Apr 14, 2016 at 2:30 PM, Yawei Li  wrote:

> Thanks for the info, Bryan!
> We are in general assess the support level of GoCQL v.s Java Driver. From
> http://gocql.github.io/, looks like it is a WIP (some TODO items, api is
> subject to change)? And https://github.com/gocql/gocql suggests the
> performance may degrade now and then, and the supported versions are up to
> 2.2.x? For us maintaining two stacks (Java and Go) may be expensive so I am
> checking what's the general strategy folks are using here.
>
> On Wed, Apr 13, 2016 at 11:31 AM, Bryan Cheng 
> wrote:
>
>> Hi Yawei,
>>
>> While you're right that there's no first-party driver, we've had good
>> luck using gocql (https://github.com/gocql/gocql) in production at
>> moderate scale. What features in particular are you looking for that are
>> missing?
>>
>> --Bryan
>>
>> On Tue, Apr 12, 2016 at 10:06 PM, Yawei Li  wrote:
>>
>>> Hi,
>>>
>>> It looks like to me that DataStax doesn't provide official golang driver
>>> yet and the goland client libs are overall lagging behind the Java driver
>>> in terms of feature set, supported version and possibly production
>>> stability?
>>>
>>> We are going to support a large number of services  in both Java and Go.
>>> if the above impression is largely true, we are considering the option of
>>> focusing on Java client and having GoLang program talk to the Java service
>>> via RPC for data access. Anyone has tried similar approach?
>>>
>>> Thanks
>>>
>>
>>


Re: MemtableReclaimMemory pending building up

2016-03-08 Thread Dan Kinder
Quick follow-up here, so far I've had these nodes stable for about 2 days
now with the following (still mysterious) solution: *increase*
memtable_heap_space_in_mb
to 20GB. This was having issues at the default value of 1/4 heap (12GB in
my case, I misspoke earlier and said 16GB). Upping it to 20GB seems to have
made the issue go away so far.

Best guess now is that it simply was memtable flush throughput. Playing
with memtable_cleanup_threshold further may have also helped but I didn't
want to create small SSTables.

Thanks again for the input @Alain.

On Fri, Mar 4, 2016 at 4:53 PM, Dan Kinder <dkin...@turnitin.com> wrote:

> Hi thanks for responding Alain. Going to provide more info inline.
>
> However a small update that is probably relevant: while the node was in
> this state (MemtableReclaimMemory building up), since this cluster is not
> serving live traffic I temporarily turned off ALL client traffic, and the
> node still never recovered, MemtableReclaimMemory never went down. Seems
> like there is one thread doing this reclaiming and it has gotten stuck
> somehow.
>
> Will let you know when I have more results from experimenting... but
> again, merci
>
> On Thu, Mar 3, 2016 at 2:32 AM, Alain RODRIGUEZ <arodr...@gmail.com>
> wrote:
>
>> Hi Dan,
>>
>> I'll try to go through all the elements:
>>
>> seeing this odd behavior happen, seemingly to single nodes at a time
>>
>>
>> Is that one node at the time or always on the same node. Do you consider
>> your data model if fairly, evenly distributed ?
>>
>
> of 6 nodes, 2 of them seem to be the recurring culprits. Could be related
> to a particular data partition.
>
>
>>
>> The node starts to take more and more memory (instance has 48GB memory on
>>> G1GC)
>>
>>
>> Do you use 48 GB heap size or is that the total amount of memory in the
>> node ? Could we have your JVM settings (GC and heap sizes), also memtable
>> size and type (off heap?) and the amount of available memory ?
>>
>
> Machine spec: 24 virtual cores, 64GB memory, 12 HDD JBOD (yes an absurd
> number of disks, not my choice)
>
> memtable_heap_space_in_mb: 10240 # 10GB (previously left as default which
> was 16GB and caused the issue more frequently)
> memtable_allocation_type: heap_buffers
> memtable_flush_writers: 12
>
> MAX_HEAP_SIZE="48G"
> JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
> JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
> JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"
> JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
> JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"
>
>>
>> Note that there is a decent number of compactions going on as well but
>>> that is expected on these nodes and this particular one is catching up from
>>> a high volume of writes
>>>
>>
>> Are the *concurrent_compactors* correctly throttled (about 8 with good
>> machines) and the *compaction_throughput_mb_per_sec* high enough to cope
>> with what is thrown at the node ? Using SSD I often see the latter
>> unthrottled (using 0 value), but I would try small increments first.
>>
> concurrent_compactors: 12
> compaction_throughput_mb_per_sec: 0
>
>>
>> Also interestingly, neither CPU nor disk utilization are pegged while
>>> this is going on
>>>
>>
>> First thing is making sure your memory management is fine. Having
>> information about the JVM and memory usage globally would help. Then, if
>> you are not fully using the resources you might want to try increasing the
>> number of *concurrent_writes* to a higher value (probably a way higher,
>> given the pending requests, but go safely, incrementally, first on a canary
>> node) and monitor tpstats + resources. Hope this will help Mutation pending
>> going down. My guess is that pending requests are messing with the JVM, but
>> it could be the exact contrary as well.
>>
> concurrent_writes: 192
> It may be worth noting that the main reads going on are large batch reads,
> while these writes are happening (akin to analytics jobs).
>
> I'm going to look into JVM use a bit more but otherwise it seems like
> normal Young generation GCs are happening even as this problem surfaces.
>
>
>>
>> Native-Transport-Requests25 0  547935519 0
>>> 2586907
>>
>>
>> About Native requests being blocked, you can probably mitigate things by
>> increasing the native_transport_max_threads: 128 (try to double it and
>> continue tuning incrementally). 

Re: MemtableReclaimMemory pending building up

2016-03-04 Thread Dan Kinder
perations = high memory pressure.
> Reducing pending stuff somehow will probably get you out off troubles.
>
> Hope this first round of ideas will help you.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-03-02 22:58 GMT+01:00 Dan Kinder <dkin...@turnitin.com>:
>
>> Also should note: Cassandra 2.2.5, Centos 6.7
>>
>> On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder <dkin...@turnitin.com> wrote:
>>
>>> Hi y'all,
>>>
>>> I am writing to a cluster fairly fast and seeing this odd behavior
>>> happen, seemingly to single nodes at a time. The node starts to take more
>>> and more memory (instance has 48GB memory on G1GC). tpstats shows that
>>> MemtableReclaimMemory Pending starts to grow first, then later
>>> MutationStage builds up as well. By then most of the memory is being
>>> consumed, GC is getting longer, node slows down and everything slows down
>>> unless I kill the node. Also the number of Active MemtableReclaimMemory
>>> threads seems to stay at 1. Also interestingly, neither CPU nor disk
>>> utilization are pegged while this is going on; it's on jbod and there is
>>> plenty of headroom there. (Note that there is a decent number of
>>> compactions going on as well but that is expected on these nodes and this
>>> particular one is catching up from a high volume of writes).
>>>
>>> Anyone have any theories on why this would be happening?
>>>
>>>
>>> $ nodetool tpstats
>>> Pool NameActive   Pending  Completed   Blocked
>>>  All time blocked
>>> MutationStage   192715481  311327142 0
>>>   0
>>> ReadStage 7 09142871 0
>>>   0
>>> RequestResponseStage  1 0  690823199 0
>>>   0
>>> ReadRepairStage   0 02145627 0
>>>   0
>>> CounterMutationStage  0 0  0 0
>>>   0
>>> HintedHandoff 0 0144 0
>>>   0
>>> MiscStage 0 0  0 0
>>>   0
>>> CompactionExecutor   1224  41022 0
>>>   0
>>> MemtableReclaimMemory 1   102   4263 0
>>>   0
>>> PendingRangeCalculator0 0 10 0
>>>   0
>>> GossipStage   0 0 148329 0
>>>   0
>>> MigrationStage0 0  0 0
>>>   0
>>> MemtablePostFlush 0 0   5233 0
>>>   0
>>> ValidationExecutor0 0  0 0
>>>   0
>>> Sampler   0     0  0 0
>>>   0
>>> MemtableFlushWriter   0 0   4270 0
>>>   0
>>> InternalResponseStage 0 0   16322698 0
>>>   0
>>> AntiEntropyStage  0 0  0 0
>>>   0
>>> CacheCleanupExecutor  0 0  0 0
>>>   0
>>> Native-Transport-Requests25 0  547935519 0
>>> 2586907
>>>
>>> Message type   Dropped
>>> READ 0
>>> RANGE_SLICE  0
>>> _TRACE   0
>>> MUTATION287057
>>> COUNTER_MUTATION 0
>>> REQUEST_RESPONSE 0
>>> PAGED_RANGE  0
>>> READ_REPAIR149
>>>
>>>
>>
>>
>> --
>> Dan Kinder
>> Principal Software Engineer
>> Turnitin – www.turnitin.com
>> dkin...@turnitin.com
>>
>


Re: MemtableReclaimMemory pending building up

2016-03-02 Thread Dan Kinder
Also should note: Cassandra 2.2.5, Centos 6.7

On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder <dkin...@turnitin.com> wrote:

> Hi y'all,
>
> I am writing to a cluster fairly fast and seeing this odd behavior happen,
> seemingly to single nodes at a time. The node starts to take more and more
> memory (instance has 48GB memory on G1GC). tpstats shows that
> MemtableReclaimMemory Pending starts to grow first, then later
> MutationStage builds up as well. By then most of the memory is being
> consumed, GC is getting longer, node slows down and everything slows down
> unless I kill the node. Also the number of Active MemtableReclaimMemory
> threads seems to stay at 1. Also interestingly, neither CPU nor disk
> utilization are pegged while this is going on; it's on jbod and there is
> plenty of headroom there. (Note that there is a decent number of
> compactions going on as well but that is expected on these nodes and this
> particular one is catching up from a high volume of writes).
>
> Anyone have any theories on why this would be happening?
>
>
> $ nodetool tpstats
> Pool NameActive   Pending  Completed   Blocked
>  All time blocked
> MutationStage   192715481  311327142 0
> 0
> ReadStage 7 09142871 0
> 0
> RequestResponseStage  1 0  690823199 0
> 0
> ReadRepairStage   0 02145627 0
> 0
> CounterMutationStage  0 0  0 0
> 0
> HintedHandoff 0 0144 0
> 0
> MiscStage 0 0  0 0
> 0
> CompactionExecutor   1224  41022 0
> 0
> MemtableReclaimMemory 1   102   4263 0
> 0
> PendingRangeCalculator0 0 10 0
> 0
> GossipStage   0 0 148329 0
> 0
> MigrationStage0 0  0 0
> 0
> MemtablePostFlush 0 0   5233 0
> 0
> ValidationExecutor0 0  0 0
> 0
> Sampler   0 0  0 0
> 0
> MemtableFlushWriter   0 0   4270 0
> 0
> InternalResponseStage 0 0   16322698 0
> 0
> AntiEntropyStage  0 0  0 0
> 0
> CacheCleanupExecutor  0 0  0 0
> 0
> Native-Transport-Requests25 0  547935519 0
>   2586907
>
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION287057
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR149
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


MemtableReclaimMemory pending building up

2016-03-02 Thread Dan Kinder
Hi y'all,

I am writing to a cluster fairly fast and seeing this odd behavior happen,
seemingly to single nodes at a time. The node starts to take more and more
memory (instance has 48GB memory on G1GC). tpstats shows that
MemtableReclaimMemory Pending starts to grow first, then later
MutationStage builds up as well. By then most of the memory is being
consumed, GC is getting longer, node slows down and everything slows down
unless I kill the node. Also the number of Active MemtableReclaimMemory
threads seems to stay at 1. Also interestingly, neither CPU nor disk
utilization are pegged while this is going on; it's on jbod and there is
plenty of headroom there. (Note that there is a decent number of
compactions going on as well but that is expected on these nodes and this
particular one is catching up from a high volume of writes).

Anyone have any theories on why this would be happening?


$ nodetool tpstats
Pool NameActive   Pending  Completed   Blocked  All
time blocked
MutationStage   192715481  311327142 0
0
ReadStage 7 09142871 0
0
RequestResponseStage  1 0  690823199 0
0
ReadRepairStage   0 02145627 0
0
CounterMutationStage  0 0  0 0
0
HintedHandoff 0 0144 0
0
MiscStage 0 0  0 0
0
CompactionExecutor   1224  41022 0
0
MemtableReclaimMemory 1   102   4263 0
0
PendingRangeCalculator0 0 10 0
0
GossipStage   0 0 148329 0
0
MigrationStage0 0  0 0
0
MemtablePostFlush 0 0   5233 0
0
ValidationExecutor0 0  0 0
0
Sampler   0 0  0 0
0
MemtableFlushWriter   0 0   4270 0
0
InternalResponseStage 0 0   16322698 0
0
AntiEntropyStage  0 0  0 0
0
CacheCleanupExecutor  0 0  0 0
0
Native-Transport-Requests25 0  547935519 0
  2586907

Message type   Dropped
READ 0
RANGE_SLICE  0
_TRACE   0
MUTATION287057
COUNTER_MUTATION 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR149


Re: Production with Single Node

2016-01-22 Thread Dan Kinder
I could see this being desirable if you are deploying the exact same
application as you deploy in other places with many nodes, and you know the
load will be low. It may be a rare situation but in such a case you save
big effort by not having to change your application logic.

Not that I necessarily recommend it but to answer John's question: my
understanding is that you want to keep it snappy and low-latency you should
watch out for GC pause and consider your GC tuning carefully, it being a
single node will cause the whole show to stop. Presumably your load won't
be very high.

Also if you are concerned with durability you may want to consider changing
commitlog_sync
<https://docs.datastax.com/en/cassandra/1.2/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__commitlog_sync>
to
batch. I believe this is the only way to guarantee write durability with
one node. Again with the performance caveat; under high load it could cause
problems.

On Fri, Jan 22, 2016 at 12:34 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> My opinion:
> http://rustyrazorblade.com/2013/09/cassandra-faq-can-i-start-with-a-single-node/
>
> TL;DR: the only reason to run 1 node in prod is if you're super broke but
> know you'll need to scale up almost immediately after going to prod (maybe
> after getting some funding).
>
> If you're planning on doing it as a more permanent solution, you've chosen
> the wrong database.
>
> On Fri, Jan 22, 2016 at 12:30 PM Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>> The risks would be about the same as with a single-node Postgres or MySQL
>> database, except that you wouldn't have the benefit of full SQL.
>>
>> How much data (rows, columns), what kind of load pattern (heavy write,
>> heavy update, heavy query), and what types of queries (primary key-only,
>> slices, filtering, secondary indexes, etc.)?
>>
>> -- Jack Krupansky
>>
>> On Fri, Jan 22, 2016 at 3:24 PM, John Lammers <
>> john.lamm...@karoshealth.com> wrote:
>>
>>> After deploying a number of production systems with up to 10 Cassandra
>>> nodes each, we are looking at deploying a small, all-in-one-server system
>>> with only a single, local node (Cassandra 2.1.11).
>>>
>>> What are the risks of such a configuration?
>>>
>>> The virtual disk would be running RAID 5 and the disk controller would
>>> have a flash backed write-behind cache.
>>>
>>> What's the best way to configure Cassandra and/or respecify the hardware
>>> for an all-in-one-box solution?
>>>
>>> Thanks-in-advance!
>>>
>>> --John
>>>
>>>
>>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: compression cpu overhead

2015-11-04 Thread Dan Kinder
To clarify, writes have no *immediate* cpu cost from adding the write to
the memtable, however the compression overhead cost is paid when writing
out a new SSTable (whether from flushing a memtable or compacting), correct?

So it sounds like when reads >> writes then Tushar's comments are accurate,
but for a high write workload flushing and compactions would create most of
the overhead.

On Tue, Nov 3, 2015 at 6:03 PM, Jon Haddad <jonathan.had...@gmail.com>
wrote:

> You won't see any overhead on writes because you don't actually write to
> sstables when performing a write.  Just the commit log & memtable.
> Memtables are flushes asynchronously.
>
> On Nov 4, 2015, at 1:57 AM, Tushar Agrawal <agrawal.tus...@gmail.com>
> wrote:
>
> For writes it's negligible. For reads it makes a significant difference
> for high tps and low latency workload. You would see up to 3x higher cpu
> with LZ4 vs no compression. It would be different for different h/w
> configurations.
>
>
> Thanks,
> Tushar
> (Sent from iPhone)
>
> On Nov 3, 2015, at 5:51 PM, Dan Kinder <dkin...@turnitin.com> wrote:
>
> Most concerned about write since that's where most of the cost is, but
> perf numbers for a any workload mix would be helpful.
>
> On Tue, Nov 3, 2015 at 3:48 PM, Graham Sanderson <gra...@vast.com> wrote:
>
>> On read or write?
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-7039 and friends in 2.2
>> should make some difference, I didn’t immediately find perf numbers though.
>>
>> On Nov 3, 2015, at 5:42 PM, Dan Kinder <dkin...@turnitin.com> wrote:
>>
>> Hey all,
>>
>> Just wondering if anyone has done seen or done any benchmarking for the
>> actual CPU overhead added by various compression algorithms in Cassandra
>> (at least LZ4) vs no compression. Clearly this is going to be workload
>> dependent but even a rough gauge would be helpful (ex. "Turning on LZ4
>> compression increases my CPU load by ~2x")
>>
>> -dan
>>
>>
>>
>
>
> --
> Dan Kinder
> Senior Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


compression cpu overhead

2015-11-03 Thread Dan Kinder
Hey all,

Just wondering if anyone has done seen or done any benchmarking for the
actual CPU overhead added by various compression algorithms in Cassandra
(at least LZ4) vs no compression. Clearly this is going to be workload
dependent but even a rough gauge would be helpful (ex. "Turning on LZ4
compression increases my CPU load by ~2x")

-dan


Re: compression cpu overhead

2015-11-03 Thread Dan Kinder
Most concerned about write since that's where most of the cost is, but perf
numbers for a any workload mix would be helpful.

On Tue, Nov 3, 2015 at 3:48 PM, Graham Sanderson <gra...@vast.com> wrote:

> On read or write?
>
> https://issues.apache.org/jira/browse/CASSANDRA-7039 and friends in 2.2
> should make some difference, I didn’t immediately find perf numbers though.
>
> On Nov 3, 2015, at 5:42 PM, Dan Kinder <dkin...@turnitin.com> wrote:
>
> Hey all,
>
> Just wondering if anyone has done seen or done any benchmarking for the
> actual CPU overhead added by various compression algorithms in Cassandra
> (at least LZ4) vs no compression. Clearly this is going to be workload
> dependent but even a rough gauge would be helpful (ex. "Turning on LZ4
> compression increases my CPU load by ~2x")
>
> -dan
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: memtable flush size with LCS

2015-11-02 Thread Dan Kinder
@Jeff Jirsa thanks the memtable_* keys were the actual determining factor
for my memtable flushes, they are what I needed to play with.

On Thu, Oct 29, 2015 at 8:23 AM, Ken Hancock <ken.hanc...@schange.com>
wrote:

> Or if you're doing a high volume of writes, then your flushed file size
> may be completely determined by other CFs that have consumed the commitlog
> size, forcing any memtables whose commitlog is being delete to be forced to
> disk.
>
>
> On Wed, Oct 28, 2015 at 2:51 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> wrote:
>
>> It’s worth mentioning that initial flushed file size is typically
>> determined by memtable_cleanup_threshold and the memtable space options
>> (memtable_heap_space_in_mb, memtable_offheap_space_in_mb, depending on
>> memtable_allocation_type)
>>
>>
>>
>> From: Nate McCall
>> Reply-To: "user@cassandra.apache.org"
>> Date: Wednesday, October 28, 2015 at 11:45 AM
>> To: Cassandra Users
>> Subject: Re: memtable flush size with LCS
>>
>>
>>  do you mean that this property is ignored at memtable flush time, and so
>>> memtables are already allowed to be much larger than sstable_size_in_mb?
>>>
>>
>> Yes, 'sstable_size_in_mb' plays no part in the flush process. Flushing
>> is based on solely on runtime activity and the file size is determined by
>> whatever was in the memtable at that time.
>>
>>
>>
>> --
>> -
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


memtable flush size with LCS

2015-10-27 Thread Dan Kinder
Hi all,

The docs indicate that memtables are triggered to flush when data in the
commitlog is expiring or based on memtable_flush_period_in_ms.

But LCS has a specified sstable size; when using LCS are memtables flushed
when they hit the desired sstable size (default 160MB) or could L0 sstables
be much larger than that?

Wondering because I have an overwrite workload where larger memtables would
be helpful, and if I need to increase my LCS sstable size in order to allow
for that.

-dan


Re: memtable flush size with LCS

2015-10-27 Thread Dan Kinder
Thanks, I am using most of the suggested parameters to tune compactions. To
clarify, when you say "The sstable_size_in_mb can be thought of a target
for the compaction process moving the file beyond L0." do you mean that
this property is ignored at memtable flush time, and so memtables are
already allowed to be much larger than sstable_size_in_mb?

On Tue, Oct 27, 2015 at 2:57 PM, Nate McCall <n...@thelastpickle.com> wrote:

> The sstable_size_in_mb can be thought of a target for the compaction
> process moving the file beyond L0.
>
> Note: If there are more than 32 SSTables in L0, it will switch over to
> doing STCS for L0 (you can disable this behavior by passing
> -Dcassandra.disable_stcs_in_l0=true as a system property).
>
> With a lot of overwrites, the settings you want to tune will be
> gc_grace_seconds in combination with tombstone_threhsold,
> tombstone_compaction_interval and maybe unchecked_tombstone_compaction
> (there are different opinions about this last one, YMMV). Making these more
> aggressive and increasing your sstable_size_in_mb will allow for
> potentially capturing more overwrites in a level which will lead to less
> fragmentation. However, making the size too large will keep compaction from
> triggering on further out levels which can then exacerbate problems
> particulary if you have long-lived TTLs.
>
> In general, it is very workload specific, but monitoring the histogram for
> the number of ssables used in a read (via
> org.apache.cassandra.metrics.ColumnFamily.$KEYSPACE.$TABLE.SSTablesPerReadHistogram.95percentile
> or shown manually in nodetool cfhistograms output) after any change will
> help you narrow in a good setting.
>
> See
> http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html?scroll=compactSubprop__compactionSubpropertiesLCS
> for more details.
>
> On Tue, Oct 27, 2015 at 3:42 PM, Dan Kinder <dkin...@turnitin.com> wrote:
> >
> > Hi all,
> >
> > The docs indicate that memtables are triggered to flush when data in the
> commitlog is expiring or based on memtable_flush_period_in_ms.
> >
> > But LCS has a specified sstable size; when using LCS are memtables
> flushed when they hit the desired sstable size (default 160MB) or could L0
> sstables be much larger than that?
> >
> > Wondering because I have an overwrite workload where larger memtables
> would be helpful, and if I need to increase my LCS sstable size in order to
> allow for that.
> >
> > -dan
>
>
>
>
> --
> -
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


future very wide row support

2015-08-31 Thread Dan Kinder
Hi,

My understanding is that wide row support (i.e. many columns/CQL-rows/cells
per partition key) has gotten much better in the past few years; even
though the theoretical of 2 billion has been much higher than practical for
a long time, it seems like now Cassandra is able to handle these better
(ex. incremental compactions so Cassandra doesn't OOM).

So I'm wondering:

   - With more recent improvements (say, including up to 2.2 or maybe 3.0),
   is the practical limit still much lower than 2 billion? Do we have any idea
   what limits us in this regard? (Maybe repair is still another bottleneck?)
   - Is the 2 billion limit a SSTable limitation?
   https://issues.apache.org/jira/browse/CASSANDRA-7447 seems to indicate
   that it might be. Is there any future work we think will increase this
   limit?

A couple of caveats:

I am aware that even if such a large partition is possible it may not
usually be practical because it works against Cassandra's primary feature
of sharding data to multiple nodes and parallelize access. However some
analytics/batch processing use-cases could benefit from the guarantee that
a certain set of data is together on a node. It can also make certain data
modeling situations a bit easier, where currently we just need to model
around the limitation. Also, 2 billion rows for small columns only adds up
to data in the tens of gigabytes, and use of larger nodes these days means
that practically one node could hold much larger partitions. And lastly,
there are just cases where the 99.999% of partition keys are going to be
pretty small, but there are potential outliers that could be very large; it
would be great for Cassandra to handle these even if it is suboptimal,
helping us all avoid having to model around such exceptions.

Well, this turned into something of an essay... thanks for reading and glad
to receive input on this.


Re: Overwhelming tombstones with LCS

2015-07-10 Thread Dan Kinder
On Sun, Jul 5, 2015 at 1:40 PM, Roman Tkachenko ro...@mailgunhq.com wrote:

 Hey guys,

 I have a table with RF=3 and LCS. Data model makes use of wide rows. A
 certain query run against this table times out and tracing reveals the
 following error on two out of three nodes:

 *Scanned over 10 tombstones; query aborted (see
 tombstone_failure_threshold)*

 This basically means every request with CL higher than one fails.

 I have two questions:

 * How could it happen that only two out of three nodes have overwhelming
 tombstones? For the third node tracing shows sensible *Read 815 live and
 837 tombstoned cells* traces.


One theory: before 2.1.6 compactions on wide rows with lots of tombstones
could take forever or potentially never finish. What version of Cassandra
are you on? It may be that you got lucky with one node that has been able
to keep up but the others haven't been able to.



 * Anything I can do to fix those two nodes? I have already set gc_grace to
 1 day and tried to make compaction strategy more aggressive
 (unchecked_tombstone_compaction - true, tombstone_threshold - 0.01) to no
 avail - a couple of days have already passed and it still gives the same
 error.


You probably want major compaction which is coming soon for LCS (
https://issues.apache.org/jira/browse/CASSANDRA-7272) but not here yet.

The alternative is, if you have enough time and headroom (this is going to
do some pretty serious compaction so be careful), alter your table to STCS,
let it compact into one SSTable, then convert back to LCS. It's pretty
heavy-handed but as long as your gc_grace is low enough it'll do the job.
Definitely do NOT do this if you have many tombstones in single wide rows
and are not 2.1.6



 Thanks!

 Roman




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Commitlog still replaying after drain shutdown

2015-06-30 Thread Dan Kinder
Hi all,

To quote Sebastian Estevez in one recent thread: You said you ran a
nodetool drain before the restart, but your logs show commitlogs replayed.
That does not add up... The docs seem to generally agree with this: if you
did `nodetool drain` before restarting your node there shouldn't be any
commitlogs.

But my experience has been that if I do `nodetool drain`, I need to wait at
least 30-60 seconds after it has finished if I really want no commitlog
replay on restart. If I restart immediately (or even 10-20s later) then it
replays plenty. (This was true on 2.X and is still true on 2.1.7 for me.)

Is this unusual or the same thing others see? Is `nodetool drain` really
supposed to wait until all memtables are flushed and commitlogs are deleted
before it returns?

Thanks,
-dan


Re: counters still inconsistent after repair

2015-06-19 Thread Dan Kinder
Thanks Rob, this was helpful.

More counters will be added soon, I'll let you know if those have any
problems.

On Mon, Jun 15, 2015 at 4:32 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Jun 15, 2015 at 2:52 PM, Dan Kinder dkin...@turnitin.com wrote:

 Potentially relevant facts:
 - Recently upgraded to 2.1.6 from 2.0.14
 - This table has ~million rows, low contention, and fairly high increment
 rate

 Can you repro on a counter that was created after the upgrade?

 Mainly wondering:

 - Is this known or expected? I know Cassandra counters have had issues
 but thought by now it should be able to keep a consistent counter or at
 least repair it...

 All counters which haven't been written to after 2.1 new counters are
 still on disk as old counters and will remain that way until UPDATEd and
 then compacted together with all old shards. Old counters can exhibit
 this behavior.

 - Any way to reset this counter?

 Per Aleksey (in IRC) you can turn a replica for an old counter into a new
 counter by UPDATEing it once.

 In order to do that without modifying the count, you can [1] :

 UPDATE tablename SET countercolumn = countercolumn +0 where id = 1;

 The important caveat that this must be done at least once per shard, with
 one shard per RF. The only way one can be sure that all shards have been
 UPDATEd is by contacting each replica node and doing the UPDATE + 0 there,
 because local writes are preferred.

 To summarize, the optimal process to upgrade your pre-existing counters to
 2.1-era new counters :

 1) get a list of all counter keys
 2) get a list of replicas per counter key
 3) connect to each replica for each counter key and issue an UPDATE + 0
 for that counter key
 4) run a major compaction

 As an aside, Aleksey suggests that the above process is so heavyweight
 that it may not be worth it. If you just leave them be, all counters you're
 actually used will become progressively more accurate over time.

 =Rob
 [1] Special thanks to Jeff Jirsa for verifying that this syntax works.




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


counters still inconsistent after repair

2015-06-15 Thread Dan Kinder
Currently on 2.1.6 I'm seeing behavior like the following:

cqlsh:walker select * from counter_table where field = 'test';
 field | value
---+---
 test  |30
(1 rows)
cqlsh:walker select * from counter_table where field = 'test';
 field | value
---+---
 test  |90
(1 rows)
cqlsh:walker select * from counter_table where field = 'test';
 field | value
---+---
 test  |30
(1 rows)

Using tracing I can see that one node has wrong data. However running
repair on this table does not seem to have done anything, I still see the
wrong value returned from this same node.

Potentially relevant facts:
- Recently upgraded to 2.1.6 from 2.0.14
- This table has ~million rows, low contention, and fairly high increment
rate

Mainly wondering:
- Is this known or expected? I know Cassandra counters have had issues but
thought by now it should be able to keep a consistent counter or at least
repair it...
- Any way to reset this counter?
- Any other stuff I can check?


Re: Multiple cassandra instances per physical node

2015-05-21 Thread Dan Kinder
@James Rothering yeah I was thinking of container in a broad sense: either
full virtual machines, docker containers, straight LXC, or whatever else
would allow the Cassandra nodes to have their own IPs and bind to default
ports.

@Jonathan Haddad thanks for the blog post. To ensure the same host does not
replicate its own data, would I basically need the nodes on a single host
to be labeled as one rack? (Assuming I use vnodes)

On Thu, May 21, 2015 at 1:02 PM, Sebastian Estevez 
sebastian.este...@datastax.com wrote:

 JBOD -- just a bunch of disks, no raid.

 All the best,


 [image: datastax_logo.png] http://www.datastax.com/

 Sebastián Estévez

 Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com

 [image: linkedin.png] https://www.linkedin.com/company/datastax [image:
 facebook.png] https://www.facebook.com/datastax [image: twitter.png]
 https://twitter.com/datastax [image: g+.png]
 https://plus.google.com/+Datastax/about
 http://feeds.feedburner.com/datastax

 http://cassandrasummit-datastax.com/

 DataStax is the fastest, most scalable distributed database technology,
 delivering Apache Cassandra to the world’s most innovative enterprises.
 Datastax is built to be agile, always-on, and predictably scalable to any
 size. With more than 500 customers in 45 countries, DataStax is the
 database technology and transactional backbone of choice for the worlds
 most innovative companies such as Netflix, Adobe, Intuit, and eBay.

 On Thu, May 21, 2015 at 4:00 PM, James Rothering jrother...@codojo.me
 wrote:

 Hmmm ... Not familiar with JBOD. Is that just RAID-0?

 Also ... wrt  the container talk, is that a Docker container you're
 talking about?



 On Thu, May 21, 2015 at 12:48 PM, Jonathan Haddad j...@jonhaddad.com
 wrote:

 If you run it in a container with dedicated IPs it'll work just fine.
 Just be sure you aren't using the same machine to replicate it's own data.

 On Thu, May 21, 2015 at 12:43 PM Manoj Khangaonkar 
 khangaon...@gmail.com wrote:

 +1.

 I agree we need to be able to run multiple server instances on one
 physical machine. This is especially necessary in development and test
 environments where one is experimenting and needs a cluster, but do not
 have access to multiple physical machines.

 If you google , you  can find a few blogs that talk about how to do
 this.

 But it is less than ideal. We need to be able to do it by changing
 ports in cassandra.yaml. ( The way it is done easily with Hadoop or Apache
 Kafka or Redis and many other distributed systems)


 regards



 On Thu, May 21, 2015 at 10:32 AM, Dan Kinder dkin...@turnitin.com
 wrote:

 Hi, I'd just like some clarity and advice regarding running multiple
 cassandra instances on a single large machine (big JBOD array, plenty of
 CPU/RAM).

 First, I am aware this was not Cassandra's original design, and doing
 this seems to unreasonably go against the commodity hardware intentions
 of Cassandra's design. In general it seems to be recommended against (at
 least as far as I've heard from @Rob Coli and others).

 However maybe this term commodity is changing... my hardware/ops
 team argues that due to cooling, power, and other datacenter costs, having
 slightly larger nodes (=32G RAM, =24 CPU, =8 disks JBOD) is actually a
 better price point. Now, I am not a hardware guy, so if this is not
 actually true I'd love to hear why, otherwise I pretty much need to take
 them at their word.

 Now, Cassandra features seemed to have improved such that JBOD works
 fairly well, but especially with memory/GC this seems to be reaching its
 limit. One Cassandra instance can only scale up so much.

 So my question is: suppose I take a 12 disk JBOD and run 2 Cassandra
 nodes (each with 5 data disks, 1 commit log disk) and either give each its
 own container  IP or change the listen ports. Will this work? What are 
 the
 risks? Will/should Cassandra support this better in the future?




 --
 http://khangaonkar.blogspot.com/






-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Multiple cassandra instances per physical node

2015-05-21 Thread Dan Kinder
Hi, I'd just like some clarity and advice regarding running multiple
cassandra instances on a single large machine (big JBOD array, plenty of
CPU/RAM).

First, I am aware this was not Cassandra's original design, and doing this
seems to unreasonably go against the commodity hardware intentions of
Cassandra's design. In general it seems to be recommended against (at least
as far as I've heard from @Rob Coli and others).

However maybe this term commodity is changing... my hardware/ops team
argues that due to cooling, power, and other datacenter costs, having
slightly larger nodes (=32G RAM, =24 CPU, =8 disks JBOD) is actually a
better price point. Now, I am not a hardware guy, so if this is not
actually true I'd love to hear why, otherwise I pretty much need to take
them at their word.

Now, Cassandra features seemed to have improved such that JBOD works fairly
well, but especially with memory/GC this seems to be reaching its limit.
One Cassandra instance can only scale up so much.

So my question is: suppose I take a 12 disk JBOD and run 2 Cassandra nodes
(each with 5 data disks, 1 commit log disk) and either give each its own
container  IP or change the listen ports. Will this work? What are the
risks? Will/should Cassandra support this better in the future?


Delete query range limitation

2015-04-15 Thread Dan Kinder
I understand that range deletes are currently not supported (
http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key
)

Since Cassandra now does have range tombstones is there a reason why it
can't be allowed? Is there a ticket for supporting this or is it a
deliberate design decision not to?


Finding nodes that own a given token/partition key

2015-03-26 Thread Dan Kinder
Hey all,

In certain cases it would be useful for us to find out which node(s) have
the data for a given token/partition key.

The only solutions I'm aware of is to select from system.local and/or
system.peers to grab the host_id and tokens, do `SELECT token(thing) FROM
myks.mytable WHERE thing = 'value';`, then do the math (put the ring
together) and figure out which node has the data. I'm assuming this is what
token aware drivers are doing.

Is there a simpler way to do this?

A bit more context: we'd like to move some processing closer to data, but
for a few reasons hadoop/spark aren't good options for the moment.


Re: Finding nodes that own a given token/partition key

2015-03-26 Thread Dan Kinder
Thanks guys, think both of these answer my question. Guess I had overlooked
nodetool getendpoints. Hopefully findable by future googlers now.

On Thu, Mar 26, 2015 at 2:37 PM, Adam Holmberg adam.holmb...@datastax.com
wrote:

 Dan,

 Depending on your context, many of the DataStax drivers have the token
 ring exposed client-side.

 For example,
 Python:
 http://datastax.github.io/python-driver/api/cassandra/metadata.html#tokens-and-ring-topology
 Java:
 http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/Metadata.html

 You may not have to construct this yourself.

 Adam Holmberg

 On Thu, Mar 26, 2015 at 3:53 PM, Roman Tkachenko ro...@mailgunhq.com
 wrote:

 Hi Dan,

 Have you tried using nodetool getendpoints? It shows you nodes that
 currently own the specific key.

 Roman

 On Thu, Mar 26, 2015 at 1:21 PM, Dan Kinder dkin...@turnitin.com wrote:

 Hey all,

 In certain cases it would be useful for us to find out which node(s)
 have the data for a given token/partition key.

 The only solutions I'm aware of is to select from system.local and/or
 system.peers to grab the host_id and tokens, do `SELECT token(thing) FROM
 myks.mytable WHERE thing = 'value';`, then do the math (put the ring
 together) and figure out which node has the data. I'm assuming this is what
 token aware drivers are doing.

 Is there a simpler way to do this?

 A bit more context: we'd like to move some processing closer to data,
 but for a few reasons hadoop/spark aren't good options for the moment.






-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-03 Thread Dan Kinder
Per Aleksey Yeschenko's comment on that ticket, it does seem like a
timestamp granularity issue, but it should work properly if it is within
the same session. gocql by default uses 2 connections and 128 streams per
connection. If you set it to 1 connection with 1 stream this problem goes
away. I suppose that'll take care of it in testing.

At least one interesting conclusion here: a gocql.Session does not map to
one Cassandra session. This makes some sense given that gocql says to use
Session shared concurrently (so it better not just be one Cassandra
session), but it is a bit concerning that there is no way to make this 100%
safe outside of cutting the gocql.Session down to 1 connection and stream.

On Mon, Mar 2, 2015 at 5:34 PM, Peter Sanford psanf...@retailnext.net
wrote:

 The more I think about it, the more this feels like a column timestamp
 issue. If two inserts have the same timestamp then the values are compared
 lexically to decide which one to keep (which I think explains the
 99/100 999/1000 mystery).

 We can verify this by also selecting out the WRITETIME of the column:

 ...
 var prevTS int
 for i := 0; i  1; i++ {
 val := fmt.Sprintf(%d, i)
 db.Query(UPDATE ut.test SET val = ? WHERE key = 'foo', val).Exec()

 var result string
 var ts int
 db.Query(SELECT val, WRITETIME(val) FROM ut.test WHERE key =
 'foo').Scan(result, ts)
 if result != val {
 fmt.Printf(Expected %v but got: %v; (prevTS:%d, ts:%d)\n, val, result,
 prevTS, ts)
 }
 prevTS = ts
 }


 When I run it with this change I see that the timestamps are in fact the
 same:

 Expected 10 but got: 9; (prevTS:1425345839903000, ts:1425345839903000)
 Expected 100 but got: 99; (prevTS:1425345839939000, ts:1425345839939000)
 Expected 101 but got: 99; (prevTS:1425345839939000, ts:1425345839939000)
 Expected 1000 but got: 999; (prevTS:1425345840296000, ts:1425345840296000)


 It looks like we're only getting millisecond precision instead of
 microsecond for the column timestamps?! If you explicitly set the timestamp
 value when you do the insert, you can get actual microsecond precision and
 the issue should go away.

 -psanford

 On Mon, Mar 2, 2015 at 4:21 PM, Dan Kinder dkin...@turnitin.com wrote:

 Yeah I thought that was suspicious too, it's mysterious and fairly
 consistent. (By the way I had error checking but removed it for email
 brevity, but thanks for verifying :) )

 On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford psanf...@retailnext.net
 wrote:

 Hmm. I was able to reproduce the behavior with your go program on my dev
 machine (C* 2.0.12). I was hoping it was going to just be an unchecked
 error from the .Exec() or .Scan(), but that is not the case for me.

 The fact that the issue seems to happen on loop iteration 10, 100 and
 1000 is pretty suspicious. I took a tcpdump to confirm that the gocql was
 in fact sending the write 100 query and then on the next read Cassandra
 responded with 99.

 I'll be interested to see what the result of the jira ticket is.

 -psanford




 --
 Dan Kinder
 Senior Software Engineer
 Turnitin – www.turnitin.com
 dkin...@turnitin.com





-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Less frequent flushing with LCS

2015-03-02 Thread Dan Kinder
Nope, they flush every 5 to 10 minutes.

On Mon, Mar 2, 2015 at 1:13 PM, Daniel Chia danc...@coursera.org wrote:

 Do the tables look like they're being flushed every hour? It seems like
 the setting memtable_flush_after_mins which I believe defaults to 60
 could also affect how often your tables are flushed.

 Thanks,
 Daniel

 On Mon, Mar 2, 2015 at 11:49 AM, Dan Kinder dkin...@turnitin.com wrote:

 I see, thanks for the input. Compression is not enabled at the moment,
 but I may try increasing that number regardless.

 Also I don't think in-memory tables would work since the dataset is
 actually quite large. The pattern is more like a given set of rows will
 receive many overwriting updates and then not be touched for a while.

 On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com
 wrote:

 On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com
 wrote:

 Theoretically sstable_size_in_mb could be causing it to flush (it's at
 the default 160MB)... though we are flushing well before we hit 160MB. I
 have not tried changing this but we don't necessarily want all the sstables
 to be large anyway,


 I've always wished that the log message told you *why* the SSTable was
 being flushed, which of the various bounds prompted the flush.

 In your case, the size on disk may be under 160MB because compression is
 enabled. I would start by increasing that size.

 Datastax DSE has in-memory tables for this use case.

 =Rob




 --
 Dan Kinder
 Senior Software Engineer
 Turnitin – www.turnitin.com
 dkin...@turnitin.com





-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Yeah I thought that was suspicious too, it's mysterious and fairly
consistent. (By the way I had error checking but removed it for email
brevity, but thanks for verifying :) )

On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford psanf...@retailnext.net
wrote:

 Hmm. I was able to reproduce the behavior with your go program on my dev
 machine (C* 2.0.12). I was hoping it was going to just be an unchecked
 error from the .Exec() or .Scan(), but that is not the case for me.

 The fact that the issue seems to happen on loop iteration 10, 100 and 1000
 is pretty suspicious. I took a tcpdump to confirm that the gocql was in
 fact sending the write 100 query and then on the next read Cassandra
 responded with 99.

 I'll be interested to see what the result of the jira ticket is.

 -psanford




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Done: https://issues.apache.org/jira/browse/CASSANDRA-8892

On Mon, Mar 2, 2015 at 3:26 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Mar 2, 2015 at 11:44 AM, Dan Kinder dkin...@turnitin.com wrote:

 I had been having the same problem as in those older post:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E


 As I said on that thread :

 It sounds unreasonable/unexpected to me, if you have a trivial repro
 case, I would file a JIRA.

 =Rob




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Less frequent flushing with LCS

2015-03-02 Thread Dan Kinder
I see, thanks for the input. Compression is not enabled at the moment, but
I may try increasing that number regardless.

Also I don't think in-memory tables would work since the dataset is
actually quite large. The pattern is more like a given set of rows will
receive many overwriting updates and then not be touched for a while.

On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder dkin...@turnitin.com wrote:

 Theoretically sstable_size_in_mb could be causing it to flush (it's at
 the default 160MB)... though we are flushing well before we hit 160MB. I
 have not tried changing this but we don't necessarily want all the sstables
 to be large anyway,


 I've always wished that the log message told you *why* the SSTable was
 being flushed, which of the various bounds prompted the flush.

 In your case, the size on disk may be under 160MB because compression is
 enabled. I would start by increasing that size.

 Datastax DSE has in-memory tables for this use case.

 =Rob




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Hey all,

I had been having the same problem as in those older post:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E

To summarize it, on my local box with just one cassandra node I can update
and then select the updated row and get an incorrect response.

My understanding is this may have to do with not having fine-grained enough
timestamp resolution, but regardless I'm wondering: is this actually a bug
or is there any way to mitigate it? It causes sporadic failures in our unit
tests, and having to Sleep() between tests isn't ideal. At least confirming
it's a bug would be nice though.

For those interested, here's a little go program that can reproduce the
issue. When I run it I typically see:
Expected 100 but got: 99
Expected 1000 but got: 999

--- main.go: ---

package main

import (
fmt

github.com/gocql/gocql
)

func main() {
cf := gocql.NewCluster(localhost)
db, _ := cf.CreateSession()
// Keyspace ut = update test
err := db.Query(`CREATE KEYSPACE IF NOT EXISTS ut
WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }`).Exec()
if err != nil {
panic(err.Error())
}
err = db.Query(CREATE TABLE IF NOT EXISTS ut.test (key text, val text,
PRIMARY KEY(key))).Exec()
if err != nil {panic(err.Error())
   }
err = db.Query(TRUNCATE ut.test).Exec()
if err != nil {
panic(err.Error())

}

err = db.Query(INSERT INTO ut.test (key) VALUES ('foo')).Exec()

if err != nil {

panic(err.Error())

}


for i := 0; i  1; i++ {

val := fmt.Sprintf(%d, i)

db.Query(UPDATE ut.test SET val = ? WHERE key = 'foo',
val).Exec()


var result string
db.Query(SELECT val FROM ut.test WHERE key = 'foo').Scan(result)
if result != val {
fmt.Printf(Expected %v but got: %v\n, val, result)
}
}

}


Less frequent flushing with LCS

2015-02-27 Thread Dan Kinder
Hi all,

We have a table in Cassandra where we frequently overwrite recent inserts.
Compaction does a fine job with this but ultimately larger memtables would
reduce compactions.

The question is: can we make Cassandra use larger memtables and flush less
frequently? What currently triggers the flushes? Opscenter shows them
flushing consistently at about 110MB in size, we have plenty of memory to
go larger.

According to
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_memtable_thruput_c.html
we can up the commit log space threshold, but this does not help, there is
plenty of runway there.

Theoretically sstable_size_in_mb could be causing it to flush (it's at the
default 160MB)... though we are flushing well before we hit 160MB. I have
not tried changing this but we don't necessarily want all the sstables to
be large anyway,

Thanks,
-dan


Re: large range read in Cassandra

2015-02-02 Thread Dan Kinder
For the benefit of others, I ended up finding out that the CQL library I
was using (https://github.com/gocql/gocql) at this time leaves paging page
size defaulted to no paging, so Cassandra was trying to pull all rows of
the partition into memory at once. Setting the page size to a reasonable
number seems to have done the trick.

On Tue, Nov 25, 2014 at 2:54 PM, Dan Kinder dkin...@turnitin.com wrote:

 Thanks, very helpful Rob, I'll watch for that.

 On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli rc...@eventbrite.com
 wrote:

 On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com
 wrote:

 To be clear, I expect this range query to take a long time and perform
 relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
 https://issues.apache.org/jira/browse/CASSANDRA-4415,
 http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
 so that we aren't literally pulling the entire thing in. Am I
 misunderstanding this use case? Could you clarify why exactly it would slow
 way down? It seems like with each read it should be doing a simple range
 read from one or two sstables.


 If you're paging through a single partition, that's likely to be fine.
 When you said range reads ... over rows my impression was you were
 talking about attempting to page through millions of partitions.

 With that confusion cleared up, the likely explanation for lack of
 availability in your case is heap pressure/GC time. Look for GCs around
 that time. Also, if you're using authentication, make sure that your
 authentication keyspace has a replication factor greater than 1.

 =Rob





 --
 Dan Kinder
 Senior Software Engineer
 Turnitin – www.turnitin.com
 dkin...@turnitin.com




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: STCS limitation with JBOD?

2015-01-06 Thread Dan Kinder
Thanks for the info guys. Regardless of the reason for using nodetool
compact, it seems like the question still stands... but he impression I'm
getting is that nodetool compact on JBOD as I described will basically fall
apart. Is that correct?

To answer Colin's question as an aside: we have a dataset with fairly high
insert load and periodic range reads (batch processing). We have a
situation where we may want rewrite some rows (changing the primary key) by
deleting and inserting as a new row. This is not something we would do on a
regular basis, but after or during the process a compact would greatly help
to clear out tombstones/rewritten data.

@Ryan Svihla it also sounds like your suggestion in this case would be:
create a new column family, rewrite all data into that, truncate/remove the
previous one, and replace it with the new one.

On Tue, Jan 6, 2015 at 9:39 AM, Ryan Svihla r...@foundev.pro wrote:

 nodetool compact is the ultimate running with scissors solution, far
 more people manage to stab themselves in the eye. Customers running with
 scissors successfully not withstanding.

 My favorite discussions usually tend to result:

1. We still have tombstones ( so they set gc_grace_seconds to 0)
2. We added a node after fixing it and now a bunch of records that
were deleted have come back (usually after setting gc_grace_seconds to 0
and then not blanking nodes that have been offline)
3. Why are my read latencies so spikey?  (cause they're on STC and now
have a giant single huge SStable which worked fine when their data set was
tiny, now they're looking at 100 sstables on STC, which means slwww
reads)
4. We still have tombstones (yeah I know this again, but this is
usually when they've switched to LCS, which basically noops with nodetool
compact)

 All of this is managed when you have a team that understands the tradeoffs
 of nodetool compact, but I categorically reject it's a good experience for
 new users, as I've unfortunately had about dozen fire drills this year as a
 result of nodetool compact alone.

 Data modeling around partitions that are truncated when falling out of
 scope is typically far more manageable, works with any compaction strategy,
 and doesn't require operational awareness at the same scale.

 On Fri, Jan 2, 2015 at 2:15 PM, Robert Coli rc...@eventbrite.com wrote:

 On Fri, Jan 2, 2015 at 11:28 AM, Colin co...@clark.ws wrote:

 Forcing a major compaction is usually a bad idea.  What is your reason
 for doing that?


 I'd say often and not usually. Lots of people have schema where they
 create way too much garbage, and major compaction can be a good response.
 The docs' historic incoherent FUD notwithstanding.

 =Rob





 --

 Thanks,
 Ryan Svihla




-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


STCS limitation with JBOD?

2015-01-02 Thread Dan Kinder
Hi,

Forcing a major compaction (using nodetool compact
http://datastax.com/documentation/cassandra/2.1/cassandra/tools/toolsCompact.html)
with STCS will result in a single sstable (ignoring repair data). However
this seems like it could be a problem for large JBOD setups. For example if
I have 12 disks, 1T each, then it seems like on this node I cannot have one
column family store more than 1T worth of data (more or less), because all
the data will end up in a single sstable that can exist only on one disk.
Is this accurate? The compaction write path docs
http://datastax.com/documentation/cassandra/2.1/cassandra/dml/dml_write_path_c.html
give a bit of hope that cassandra could split the one final sstable across
the disks, but I doubt it is able to and want to confirm.

I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in
JBOD mode could be solutions to this (with their own problems), but want to
verify that this actually is a problem.

-dan


Re: large range read in Cassandra

2014-11-25 Thread Dan Kinder
Thanks Rob.

To be clear, I expect this range query to take a long time and perform
relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
https://issues.apache.org/jira/browse/CASSANDRA-4415,
http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
so that we aren't literally pulling the entire thing in. Am I
misunderstanding this use case? Could you clarify why exactly it would slow
way down? It seems like with each read it should be doing a simple range
read from one or two sstables.

If this won't work then it may me we need to start using Hive/Spark/Pig
etc. sooner, or page it manually using LIMIT and WHERE  [the last returned
result].

On Mon, Nov 24, 2014 at 5:49 PM, Robert Coli rc...@eventbrite.com wrote:

 On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder dkin...@turnitin.com wrote:

 We have a web crawler project currently based on Cassandra (
 https://github.com/iParadigms/walker, written in Go and using the gocql
 driver), with the following relevant usage pattern:

 - Big range reads over a CF to grab potentially millions of rows and
 dispatch new links to crawl


 If you really mean millions of storage rows, this is just about the worst
 case for Cassandra. The problem you're having is probably that you
 shouldn't try to do this in Cassandra.

 Your timeouts are either from the read actually taking longer than the
 timeout or from the reads provoking heap pressure and resulting GC.

 =Rob




Re: large range read in Cassandra

2014-11-25 Thread Dan Kinder
Thanks, very helpful Rob, I'll watch for that.

On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder dkin...@turnitin.com wrote:

 To be clear, I expect this range query to take a long time and perform
 relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
 https://issues.apache.org/jira/browse/CASSANDRA-4415,
 http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
 so that we aren't literally pulling the entire thing in. Am I
 misunderstanding this use case? Could you clarify why exactly it would slow
 way down? It seems like with each read it should be doing a simple range
 read from one or two sstables.


 If you're paging through a single partition, that's likely to be fine.
 When you said range reads ... over rows my impression was you were
 talking about attempting to page through millions of partitions.

 With that confusion cleared up, the likely explanation for lack of
 availability in your case is heap pressure/GC time. Look for GCs around
 that time. Also, if you're using authentication, make sure that your
 authentication keyspace has a replication factor greater than 1.

 =Rob





-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


large range read in Cassandra

2014-11-24 Thread Dan Kinder
Hi,

We have a web crawler project currently based on Cassandra (
https://github.com/iParadigms/walker, written in Go and using the gocql
driver), with the following relevant usage pattern:

- Big range reads over a CF to grab potentially millions of rows and
dispatch new links to crawl
- Fast insert of new links (effectively using Cassandra to deduplicate)

We ultimately planned on doing the batch processing step (the dispatching)
in a system like Spark, but for the time being it is also in Go. We believe
this should work fine given that Cassandra now properly allows chunked
iteration of columns in a CF.

The issue is, periodically while doing a particularly large range read,
other operations time out because that node is busy. In an experimental
cluster with only two nodes (and replication factor of 2), I'll get an
error like: Operation timed out - received only 1 responses. Indicating
that the second node took too long to reply. At the moment I have the long
range reads set to consistency level ANY but the rest of the operations are
on QUORUM, so on this cluster they require responses from both nodes. The
relevant CF is also using LeveledCompactionStrategy. This happens in both
Cassandra 2 and 2.1.

Despite this error I don't see any significant I/O, memory consumption, or
CPU usage.

Here are some of the configuration values I've played with:

Increasing timeouts:
read_request_timeout_in_ms:
15000
range_request_timeout_in_ms:
3
write_request_timeout_in_ms:
1
request_timeout_in_ms: 1

Getting rid of caches we don't need:
key_cache_size_in_mb: 0
row_cache_size_in_mb: 0

Each of the 2 nodes has an HDD for commit log and single HDD I'm using for
data. Hence the following thread config (maybe since I/O is not an issue I
should increase these?):
concurrent_reads: 16
concurrent_writes: 32
concurrent_counter_writes: 32

Because I have a large number columns and aren't doing random I/O I've
increased this:
column_index_size_in_kb: 2048

It's something of a mystery why this error comes up. Of course with a 3rd
node it will get masked if I am doing QUORUM operations, but it still seems
like it should not happen, and that there is some kind of head-of-line
blocking or other issue in Cassandra. I would like to increase the amount
of dispatching I'm doing because of this it bogs it down if I do.

Any suggestions for other things we can try here would be appreciated.

-dan