Re: storing indexes on ssd

2018-02-13 Thread Dan Kinder
On a single node that's a bit less than half full, the index files are 87G.

How will OS disk cache know to keep the index file blocks cached but not
cache blocks from the data files? As far as I know it is not smart enough
to handle that gracefully.

Re: ram expensiveness, see
https://www.extremetech.com/computing/263031-ram-prices-roof-stuck-way --
it's really not an important point though, ram is still far more expensive
than disk, regardless of whether the price has been going up.

On Tue, Feb 13, 2018 at 12:02 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Feb 13, 2018 at 1:30 AM, Dan Kinder  wrote:
>
>> Created https://issues.apache.org/jira/browse/CASSANDRA-14229
>>
>
> This is confusing.  You've already started the conversation here...
>
> How big are your index files in the end?  Even if Cassandra doesn't cache
> them in or (off-) heap, they might as well just fit into the OS disk cache.
>
> From your ticket description:
> > ... as ram continues to get more expensive,..
>
> Where did you get that from?  I would expect quite the opposite.
>
> Regards,
> --
> Alex
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: storing indexes on ssd

2018-02-12 Thread Dan Kinder
Created https://issues.apache.org/jira/browse/CASSANDRA-14229

On Mon, Feb 12, 2018 at 12:10 AM, Mateusz Korniak <
mateusz-li...@ant.gliwice.pl> wrote:

> On Saturday 10 of February 2018 23:09:40 Dan Kinder wrote:
> > We're optimizing Cassandra right now for fairly random reads on a large
> > dataset. In this dataset, the values are much larger than the keys. I was
> > wondering, is it possible to have Cassandra write the *index* files
> > (*-Index.db) to one drive (SSD), but write the *data* files (*-Data.db)
> to
> > another (HDD)? This would be an overall win for us since it's
> > cost-prohibitive to store the data itself all on SSD, but we hit the
> limits
> > if we just use HDD; effectively we would need to buy double, since we are
> > doing 2 random reads (index + data).
>
> Considered putting cassandra data on lvmcache?
> We are using this on small (3x2TB compressed data, 128/256MB cache)
> clusters
> since reaching I/O limits of 2xHDD in RAID10.
>
>
> --
> Mateusz Korniak
> "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
> krótko mówiąc - podpora społeczeństwa."
> Nikos Kazantzakis - "Grek Zorba"
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Setting min_index_interval to 1?

2018-02-12 Thread Dan Kinder
@Hannu this was based on the assumption that if we receive a read for a key
that is sampled, it'll be treated as cached and won't go to the index on
disk. Part of my question was whether that's the case, I'm not sure.

Btw I ended up giving up on this, trying the key cache route already showed
that it would require more memory than we have available. And even then,
the performance started to tank; we saw irqbalance and other processes peg
the CPU even with not too much load, so there was some numa-related problem
there that I don't have time to look into.

On Fri, Feb 2, 2018 at 12:42 AM, Hannu Kröger  wrote:

> Wouldn’t that still try to read the index on the disk? So you would just
> potentially have all keys on the memory and on the disk and reading would
> first happen in memory and then on the disk and only after that you would
> read the sstable.
>
> So you wouldn’t gain much, right?
>
> Hannu
>
> On 2 Feb 2018, at 02:25, Nate McCall  wrote:
>
>
>> Another was the crazy idea I started with of setting min_index_interval
>> to 1. My guess was that this would cause it to read all index entries, and
>> effectively have them all cached permanently. And it would read them
>> straight out of the SSTables on every restart. Would this work? Other than
>> probably causing a really long startup time, are there issues with this?
>>
>>
> I've never tried that. It sounds like you understand the potential impact
> on memory and startup time. If you have the data in such a way that you can
> easily experiment, I would like to see a breakdown of the impact on
> response time vs. memory usage as well as where the point of diminishing
> returns is on turning this down towards 1 (I think there will be a sweet
> spot somewhere).
>
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


storing indexes on ssd

2018-02-10 Thread Dan Kinder
Hi,

We're optimizing Cassandra right now for fairly random reads on a large
dataset. In this dataset, the values are much larger than the keys. I was
wondering, is it possible to have Cassandra write the *index* files
(*-Index.db) to one drive (SSD), but write the *data* files (*-Data.db) to
another (HDD)? This would be an overall win for us since it's
cost-prohibitive to store the data itself all on SSD, but we hit the limits
if we just use HDD; effectively we would need to buy double, since we are
doing 2 random reads (index + data).

Thanks,
-dan


Setting min_index_interval to 1?

2018-02-01 Thread Dan Kinder
Hi, I have an unusual case here: I'm wondering what will happen if I
set min_index_interval to 1.

Here's the logic. Suppose I have a table where I really want to squeeze as
many reads/sec out of it as possible, and where the row data size is much
larger than the keys. E.g. the keys are a few bytes, the row data is ~500KB.

This table would be a great candidate for key caching. Let's suppose I have
enough memory to have every key cached. However, it's a lot of data, and
the reads are very random. So it would take a very long time for that cache
to warm up.

One solution is that I write a little app to go through every key to warm
it up manually, and ensure that Cassandra has key_cache_keys_to_save set to
save the whole thing on restart. (Anyone know of a better way of doing
this?)

Another was the crazy idea I started with of setting min_index_interval to
1. My guess was that this would cause it to read all index entries, and
effectively have them all cached permanently. And it would read them
straight out of the SSTables on every restart. Would this work? Other than
probably causing a really long startup time, are there issues with this?

Thanks,
-dan


LCS major compaction on 3.2+ on JBOD

2017-10-05 Thread Dan Kinder
Hi

I am wondering how major compaction behaves for a table using LCS on JBOD
with Cassandra 3.2+'s JBOD improvements.

Up to then I know that major compaction would use a single thread, include
all SSTables in a single compaction, and spit out a bunch of SSTables in
appropriate levels.

Does 3.2+ do 1 compaction per disk, since they are separate leveled
structures? Or does it do a single compaction task that writes SSTables to
the appropriate disk by key range?

-dan


Re:

2017-10-02 Thread Dan Kinder
Created https://issues.apache.org/jira/browse/CASSANDRA-13923

On Mon, Oct 2, 2017 at 12:06 PM, Dan Kinder  wrote:

> Sure will do.
>
> On Mon, Oct 2, 2017 at 11:48 AM, Jeff Jirsa  wrote:
>
>> You're right, sorry I didnt read the full stack (gmail hid it from me)
>>
>> Would you open a JIRA with your stack traces, and note (somewhat loudly)
>> that this is a regression?
>>
>>
>> On Mon, Oct 2, 2017 at 11:43 AM, Dan Kinder  wrote:
>>
>>> Right, I just meant that calling it at all results in holding a read
>>> lock, which unfortunately is blocking these read threads.
>>>
>>> On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa  wrote:
>>>
>>>>
>>>>
>>>> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder 
>>>> wrote:
>>>>
>>>>> (As a side note, it seems silly to call shouldDefragment at all on a
>>>>> read if the compaction strategy is not STSC)
>>>>>
>>>>>
>>>>>
>>>> It defaults to false:
>>>>
>>>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/j
>>>> ava/org/apache/cassandra/db/compaction/AbstractCompactionStr
>>>> ategy.java#L302
>>>>
>>>> And nothing else other than STCS overrides it to true.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Dan Kinder
>>> Principal Software Engineer
>>> Turnitin – www.turnitin.com
>>> dkin...@turnitin.com
>>>
>>
>>
>
>
> --
> Dan Kinder
> Principal Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>



-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re:

2017-10-02 Thread Dan Kinder
Sure will do.

On Mon, Oct 2, 2017 at 11:48 AM, Jeff Jirsa  wrote:

> You're right, sorry I didnt read the full stack (gmail hid it from me)
>
> Would you open a JIRA with your stack traces, and note (somewhat loudly)
> that this is a regression?
>
>
> On Mon, Oct 2, 2017 at 11:43 AM, Dan Kinder  wrote:
>
>> Right, I just meant that calling it at all results in holding a read
>> lock, which unfortunately is blocking these read threads.
>>
>> On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa  wrote:
>>
>>>
>>>
>>> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder 
>>> wrote:
>>>
>>>> (As a side note, it seems silly to call shouldDefragment at all on a
>>>> read if the compaction strategy is not STSC)
>>>>
>>>>
>>>>
>>> It defaults to false:
>>>
>>> https://github.com/apache/cassandra/blob/cassandra-3.0/src/j
>>> ava/org/apache/cassandra/db/compaction/AbstractCompactionStr
>>> ategy.java#L302
>>>
>>> And nothing else other than STCS overrides it to true.
>>>
>>>
>>>
>>
>>
>> --
>> Dan Kinder
>> Principal Software Engineer
>> Turnitin – www.turnitin.com
>> dkin...@turnitin.com
>>
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re:

2017-10-02 Thread Dan Kinder
Right, I just meant that calling it at all results in holding a read lock,
which unfortunately is blocking these read threads.

On Mon, Oct 2, 2017 at 11:40 AM, Jeff Jirsa  wrote:

>
>
> On Mon, Oct 2, 2017 at 11:27 AM, Dan Kinder  wrote:
>
>> (As a side note, it seems silly to call shouldDefragment at all on a read
>> if the compaction strategy is not STSC)
>>
>>
>>
> It defaults to false:
>
> https://github.com/apache/cassandra/blob/cassandra-3.0/
> src/java/org/apache/cassandra/db/compaction/AbstractCompactionStrategy.
> java#L302
>
> And nothing else other than STCS overrides it to true.
>
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re:

2017-09-28 Thread Dan Kinder
Sorry, for that ReadStage exception, I take it back, accidentally ended up
too early in the logs. This node that has building ReadStage shows no
exceptions in the logs.

nodetool tpstats
Pool Name Active   Pending  Completed   Blocked
 All time blocked
ReadStage  8  1882  45881 0
0
MiscStage  0 0  0 0
0
CompactionExecutor 9 9   2551 0
0
MutationStage  0 0   35929880 0
0
GossipStage0 0  35793 0
0
RequestResponseStage   0 0 751285 0
0
ReadRepairStage0 0224 0
0
CounterMutationStage   0 0  0 0
0
MemtableFlushWriter0 0111 0
0
MemtablePostFlush  0 0239 0
0
ValidationExecutor 0 0  0 0
0
ViewMutationStage  0 0  0 0
0
CacheCleanupExecutor   0 0  0 0
0
PerDiskMemtableFlushWriter_10  0 0104 0
0
PerDiskMemtableFlushWriter_11  0 0104 0
0
MemtableReclaimMemory  0 0116 0
0
PendingRangeCalculator 0 0 16 0
0
SecondaryIndexManagement   0 0  0 0
0
HintsDispatcher0 0 13 0
0
PerDiskMemtableFlushWriter_1   0 0104 0
0
Native-Transport-Requests  0 02607030 0
0
PerDiskMemtableFlushWriter_2   0 0104 0
0
MigrationStage 0 0278 0
0
PerDiskMemtableFlushWriter_0   0 0115 0
0
Sampler0 0  0 0
0
PerDiskMemtableFlushWriter_5   0 0104 0
0
InternalResponseStage  0 0298 0
0
PerDiskMemtableFlushWriter_6   0 0104 0
0
PerDiskMemtableFlushWriter_3   0 0104 0
0
PerDiskMemtableFlushWriter_4   0 0104 0
0
PerDiskMemtableFlushWriter_9   0 0104 0
0
AntiEntropyStage   0 0  0 0
0
PerDiskMemtableFlushWriter_7   0 0104 0
0
PerDiskMemtableFlushWriter_8   0 0104 0
0

Message type   Dropped
READ 0
RANGE_SLICE  0
_TRACE   0
HINT 0
MUTATION 0
COUNTER_MUTATION 0
BATCH_STORE  0
BATCH_REMOVE 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR  0


On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder  wrote:

> Thanks for the responses.
>
> @Prem yes this is after the entire cluster is on 3.11, but no I did not
> run upgradesstables yet.
>
> @Thomas no I don't see any major GC going on.
>
> @Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
> bring it back (thankfully this cluster is not serving live traffic). The
> nodes seemed okay for an hour or two, but I see the issue again, without me
> bouncing any nodes. This time it's ReadStage that's building up, and the
> exception I'm seeing in the logs is:
>
> DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242
> - Digest mismatch:
>
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(6150926370328526396, 696a6374652e6f7267) (
> 2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)
>
> at 
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 

Re:

2017-09-28 Thread Dan Kinder
Thanks for the responses.

@Prem yes this is after the entire cluster is on 3.11, but no I did not run
upgradesstables yet.

@Thomas no I don't see any major GC going on.

@Jeff yeah it's fully upgraded. I decided to shut the whole thing down and
bring it back (thankfully this cluster is not serving live traffic). The
nodes seemed okay for an hour or two, but I see the issue again, without me
bouncing any nodes. This time it's ReadStage that's building up, and the
exception I'm seeing in the logs is:

DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 -
Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key
DecoratedKey(6150926370328526396, 696a6374652e6f7267)
(2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)

at
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_71]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_71]

at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]


Do you think running upgradesstables would help? Or relocatesstables? I
presumed it shouldn't be necessary for Cassandra to function, just an
optimization.

On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Dan,
>
>
>
> do you see any major GC? We have been hit by the following memory leak in
> our loadtest environment with 3.11.0.
>
> https://issues.apache.org/jira/browse/CASSANDRA-13754
>
>
>
> So, depending on the heap size and uptime, you might get into heap
> troubles.
>
>
>
> Thomas
>
>
>
> *From:* Dan Kinder [mailto:dkin...@turnitin.com]
> *Sent:* Donnerstag, 28. September 2017 18:20
> *To:* user@cassandra.apache.org
> *Subject:*
>
>
>
> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
> at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and rest

Re:

2017-09-28 Thread Dan Kinder
I should also note, I also see nodes become locked up without seeing that
Exception. But the GossipStage buildup does seem correlated with gossip
activity, e.g. me restarting a different node.

On Thu, Sep 28, 2017 at 9:20 AM, Dan Kinder  wrote:

> Hi,
>
> I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
> following. The cluster does function, for a while, but then some stages
> begin to back up and the node does not recover and does not drain the
> tasks, even under no load. This happens both to MutationStage and
> GossipStage.
>
> I do see the following exception happen in the logs:
>
>
> ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
> CassandraDaemon.java:228 - Exception in thread
> Thread[ReadRepairStage:2328,5,main]
>
> org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out
> - received only 1 responses.
>
> at org.apache.cassandra.service.DataResolver$
> RepairMergeListener.close(DataResolver.java:171)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at org.apache.cassandra.db.partitions.
> UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_91]
>
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> ~[na:1.8.0_91]
>
> at org.apache.cassandra.concurrent.NamedThreadFactory.
> lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> ~[apache-cassandra-3.11.0.jar:3.11.0]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]
>
>
> But it's hard to correlate precisely with things going bad. It is also
> very strange to me since I have both read_repair_chance and
> dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
> confusing why ReadRepairStage would err.
>
> Anyone have thoughts on this? It's pretty muddling, and causes nodes to
> lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
> If I can't find a resolution I'm going to need to downgrade and restore to
> backup...
>
> The only issue I found that looked similar is https://issues.apache.org/
> jira/browse/CASSANDRA-12689 but that appears to be fixed by 3.10.
>
>
> $ nodetool tpstats
>
> Pool Name Active   Pending  Completed
> Blocked  All time blocked
>
> ReadStage  0 0 582103 0
> 0
>
> MiscStage  0 0  0 0
> 0
>
> CompactionExecutor1111   2868 0
> 0
>
> MutationStage 32   4593678   55057393 0
> 0
>
> GossipStage1  2818 371487 0
> 0
>
> RequestResponseStage   0 04345522 0
> 0
>
> ReadRepairStage0 0 151473 0
> 0
>
> CounterMutationStage   0 0  0 0
> 0
>
> MemtableFlushWriter181 76 0
> 0
>
> MemtablePostFlush  1   382139 0
> 0
>
> ValidationExecutor 0 0  0 0
> 0
>
> ViewMutationStage  0 0  0 0
> 0
>
> CacheCleanupExecutor   0 0  0 0
> 0
>
> PerDiskMemtableFlushWriter_10  0 0 69 0
> 0
>
> PerDiskMemtableFlushWriter_11  0 0 69 0
> 0
>
> MemtableReclaimMemory  0 0 81 0
> 0
>
> PendingRangeCalc

[no subject]

2017-09-28 Thread Dan Kinder
Hi,

I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the
following. The cluster does function, for a while, but then some stages
begin to back up and the node does not recover and does not drain the
tasks, even under no load. This happens both to MutationStage and
GossipStage.

I do see the following exception happen in the logs:


ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440
CassandraDaemon.java:228 - Exception in thread
Thread[ReadRepairStage:2328,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out -
received only 1 responses.

at
org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
~[apache-cassandra-3.11.0.jar:3.11.0]

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_91]

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_91]

at
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
~[apache-cassandra-3.11.0.jar:3.11.0]

at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]


But it's hard to correlate precisely with things going bad. It is also very
strange to me since I have both read_repair_chance and
dclocal_read_repair_chance set to 0.0 for ALL of my tables. So it is
confusing why ReadRepairStage would err.

Anyone have thoughts on this? It's pretty muddling, and causes nodes to
lock up. Once it happens Cassandra can't even shut down, I have to kill -9.
If I can't find a resolution I'm going to need to downgrade and restore to
backup...

The only issue I found that looked similar is
https://issues.apache.org/jira/browse/CASSANDRA-12689 but that appears to
be fixed by 3.10.


$ nodetool tpstats

Pool Name Active   Pending  Completed   Blocked
All time blocked

ReadStage  0 0 582103 0
  0

MiscStage  0 0  0 0
  0

CompactionExecutor1111   2868 0
  0

MutationStage 32   4593678   55057393 0
  0

GossipStage1  2818 371487 0
  0

RequestResponseStage   0 04345522 0
  0

ReadRepairStage0 0 151473 0
  0

CounterMutationStage   0 0  0 0
  0

MemtableFlushWriter181 76 0
  0

MemtablePostFlush  1   382139 0
  0

ValidationExecutor 0 0  0 0
  0

ViewMutationStage  0 0  0 0
  0

CacheCleanupExecutor   0 0  0 0
  0

PerDiskMemtableFlushWriter_10  0 0 69 0
  0

PerDiskMemtableFlushWriter_11  0 0 69 0
  0

MemtableReclaimMemory  0 0 81 0
  0

PendingRangeCalculator 0 0 32 0
  0

SecondaryIndexManagement   0 0  0 0
  0

HintsDispatcher0 0596 0
  0

PerDiskMemtableFlushWriter_1   0 0 69 0
  0

Native-Transport-Requests 11 04547746 0
  67

PerDiskMemtableFlushWriter_2   0 0 69 0
  0

MigrationStage 1  1545586 0
  0

PerDiskMemtableFlushWriter_0   0 0 80 0
  0

Sampler0 0  0 0
  0

PerDiskMemtableFlushWriter_5   0 0 69 0
  

Re: Problems with large partitions and compaction

2017-02-15 Thread Dan Kinder
What Cassandra version? CMS or G1? What are your timeouts set to?

"GC activity"  - Even if there isn't a lot of activity per se maybe there
is a single long pause happening. I have seen large partitions cause lots
of allocation fast.

Looking at SSTable Levels in nodetool cfstats can help, look at it for all
your tables.

Don't recommend switching to STCS until you know more. You end up with
massive compaction that takes a long time to settle down.

On Tue, Feb 14, 2017 at 5:50 PM, John Sanda  wrote:

> I have a table that uses LCS and has wound up with partitions upwards of
> 700 MB. I am seeing lots of the large partition warnings. Client requests
> are subsequently failing. The driver is not reporting timeout exception,
> just NoHostAvailableExceptions (in the logs I have reviewed so far). I know
> that I need to redesign the table to avoid such large partitions. What
> specifically goes wrong that results in the instability I am seeing? Or put
> another way, what issues will compacting really large partitions cause?
> Initially I thought that there was high GC activity, but after closer
> inspection that does not really seem to happening. And most of the failures
> I am seeing are on reads, but for an entirely different table. Lastly, does
> anyone has anyone had success to switching to STCS in this situation as a
> work around?
>
> Thanks
>
> - John
>



-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Cassandra Golang Driver and Support

2016-04-14 Thread Dan Kinder
Just want to put a plug in for gocql and the guys who work on it. I use it
for production applications that sustain ~10,000 writes/sec on an 8 node
cluster and in the few times I have seen problems they have been responsive
on issues and pull requests. Once or twice I have seen the API change but
otherwise it has been stable. In general I have found it very intuitive to
use and easy to configure.

On Thu, Apr 14, 2016 at 2:30 PM, Yawei Li  wrote:

> Thanks for the info, Bryan!
> We are in general assess the support level of GoCQL v.s Java Driver. From
> http://gocql.github.io/, looks like it is a WIP (some TODO items, api is
> subject to change)? And https://github.com/gocql/gocql suggests the
> performance may degrade now and then, and the supported versions are up to
> 2.2.x? For us maintaining two stacks (Java and Go) may be expensive so I am
> checking what's the general strategy folks are using here.
>
> On Wed, Apr 13, 2016 at 11:31 AM, Bryan Cheng 
> wrote:
>
>> Hi Yawei,
>>
>> While you're right that there's no first-party driver, we've had good
>> luck using gocql (https://github.com/gocql/gocql) in production at
>> moderate scale. What features in particular are you looking for that are
>> missing?
>>
>> --Bryan
>>
>> On Tue, Apr 12, 2016 at 10:06 PM, Yawei Li  wrote:
>>
>>> Hi,
>>>
>>> It looks like to me that DataStax doesn't provide official golang driver
>>> yet and the goland client libs are overall lagging behind the Java driver
>>> in terms of feature set, supported version and possibly production
>>> stability?
>>>
>>> We are going to support a large number of services  in both Java and Go.
>>> if the above impression is largely true, we are considering the option of
>>> focusing on Java client and having GoLang program talk to the Java service
>>> via RPC for data access. Anyone has tried similar approach?
>>>
>>> Thanks
>>>
>>
>>


Re: MemtableReclaimMemory pending building up

2016-03-08 Thread Dan Kinder
Quick follow-up here, so far I've had these nodes stable for about 2 days
now with the following (still mysterious) solution: *increase*
memtable_heap_space_in_mb
to 20GB. This was having issues at the default value of 1/4 heap (12GB in
my case, I misspoke earlier and said 16GB). Upping it to 20GB seems to have
made the issue go away so far.

Best guess now is that it simply was memtable flush throughput. Playing
with memtable_cleanup_threshold further may have also helped but I didn't
want to create small SSTables.

Thanks again for the input @Alain.

On Fri, Mar 4, 2016 at 4:53 PM, Dan Kinder  wrote:

> Hi thanks for responding Alain. Going to provide more info inline.
>
> However a small update that is probably relevant: while the node was in
> this state (MemtableReclaimMemory building up), since this cluster is not
> serving live traffic I temporarily turned off ALL client traffic, and the
> node still never recovered, MemtableReclaimMemory never went down. Seems
> like there is one thread doing this reclaiming and it has gotten stuck
> somehow.
>
> Will let you know when I have more results from experimenting... but
> again, merci
>
> On Thu, Mar 3, 2016 at 2:32 AM, Alain RODRIGUEZ 
> wrote:
>
>> Hi Dan,
>>
>> I'll try to go through all the elements:
>>
>> seeing this odd behavior happen, seemingly to single nodes at a time
>>
>>
>> Is that one node at the time or always on the same node. Do you consider
>> your data model if fairly, evenly distributed ?
>>
>
> of 6 nodes, 2 of them seem to be the recurring culprits. Could be related
> to a particular data partition.
>
>
>>
>> The node starts to take more and more memory (instance has 48GB memory on
>>> G1GC)
>>
>>
>> Do you use 48 GB heap size or is that the total amount of memory in the
>> node ? Could we have your JVM settings (GC and heap sizes), also memtable
>> size and type (off heap?) and the amount of available memory ?
>>
>
> Machine spec: 24 virtual cores, 64GB memory, 12 HDD JBOD (yes an absurd
> number of disks, not my choice)
>
> memtable_heap_space_in_mb: 10240 # 10GB (previously left as default which
> was 16GB and caused the issue more frequently)
> memtable_allocation_type: heap_buffers
> memtable_flush_writers: 12
>
> MAX_HEAP_SIZE="48G"
> JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
> JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
> JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"
> JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
> JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25"
>
>>
>> Note that there is a decent number of compactions going on as well but
>>> that is expected on these nodes and this particular one is catching up from
>>> a high volume of writes
>>>
>>
>> Are the *concurrent_compactors* correctly throttled (about 8 with good
>> machines) and the *compaction_throughput_mb_per_sec* high enough to cope
>> with what is thrown at the node ? Using SSD I often see the latter
>> unthrottled (using 0 value), but I would try small increments first.
>>
> concurrent_compactors: 12
> compaction_throughput_mb_per_sec: 0
>
>>
>> Also interestingly, neither CPU nor disk utilization are pegged while
>>> this is going on
>>>
>>
>> First thing is making sure your memory management is fine. Having
>> information about the JVM and memory usage globally would help. Then, if
>> you are not fully using the resources you might want to try increasing the
>> number of *concurrent_writes* to a higher value (probably a way higher,
>> given the pending requests, but go safely, incrementally, first on a canary
>> node) and monitor tpstats + resources. Hope this will help Mutation pending
>> going down. My guess is that pending requests are messing with the JVM, but
>> it could be the exact contrary as well.
>>
> concurrent_writes: 192
> It may be worth noting that the main reads going on are large batch reads,
> while these writes are happening (akin to analytics jobs).
>
> I'm going to look into JVM use a bit more but otherwise it seems like
> normal Young generation GCs are happening even as this problem surfaces.
>
>
>>
>> Native-Transport-Requests25 0  547935519 0
>>> 2586907
>>
>>
>> About Native requests being blocked, you can probably mitigate things by
>> increasing the native_transport_max_threads: 128 (try to double it and
>> continue tuning incrementally). Also, an up to date client, using 

Re: MemtableReclaimMemory pending building up

2016-03-04 Thread Dan Kinder
ions = high memory pressure.
> Reducing pending stuff somehow will probably get you out off troubles.
>
> Hope this first round of ideas will help you.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-03-02 22:58 GMT+01:00 Dan Kinder :
>
>> Also should note: Cassandra 2.2.5, Centos 6.7
>>
>> On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder  wrote:
>>
>>> Hi y'all,
>>>
>>> I am writing to a cluster fairly fast and seeing this odd behavior
>>> happen, seemingly to single nodes at a time. The node starts to take more
>>> and more memory (instance has 48GB memory on G1GC). tpstats shows that
>>> MemtableReclaimMemory Pending starts to grow first, then later
>>> MutationStage builds up as well. By then most of the memory is being
>>> consumed, GC is getting longer, node slows down and everything slows down
>>> unless I kill the node. Also the number of Active MemtableReclaimMemory
>>> threads seems to stay at 1. Also interestingly, neither CPU nor disk
>>> utilization are pegged while this is going on; it's on jbod and there is
>>> plenty of headroom there. (Note that there is a decent number of
>>> compactions going on as well but that is expected on these nodes and this
>>> particular one is catching up from a high volume of writes).
>>>
>>> Anyone have any theories on why this would be happening?
>>>
>>>
>>> $ nodetool tpstats
>>> Pool NameActive   Pending  Completed   Blocked
>>>  All time blocked
>>> MutationStage   192715481  311327142 0
>>>   0
>>> ReadStage 7 09142871 0
>>>   0
>>> RequestResponseStage  1 0  690823199 0
>>>   0
>>> ReadRepairStage   0 02145627 0
>>>   0
>>> CounterMutationStage  0 0  0 0
>>>   0
>>> HintedHandoff 0 0144 0
>>>   0
>>> MiscStage 0 0  0 0
>>>   0
>>> CompactionExecutor   1224  41022 0
>>>   0
>>> MemtableReclaimMemory 1   102   4263 0
>>>   0
>>> PendingRangeCalculator0 0 10 0
>>>   0
>>> GossipStage   0 0 148329 0
>>>   0
>>> MigrationStage0 0  0 0
>>>   0
>>> MemtablePostFlush 0 0   5233 0
>>>   0
>>> ValidationExecutor0 0  0 0
>>>   0
>>> Sampler       0     0  0 0
>>>   0
>>> MemtableFlushWriter   0 0   4270 0
>>>   0
>>> InternalResponseStage 0 0   16322698 0
>>>   0
>>> AntiEntropyStage  0 0  0 0
>>>   0
>>> CacheCleanupExecutor  0 0  0 0
>>>   0
>>> Native-Transport-Requests25 0  547935519 0
>>> 2586907
>>>
>>> Message type   Dropped
>>> READ 0
>>> RANGE_SLICE  0
>>> _TRACE   0
>>> MUTATION287057
>>> COUNTER_MUTATION 0
>>> REQUEST_RESPONSE 0
>>> PAGED_RANGE  0
>>> READ_REPAIR149
>>>
>>>
>>
>>
>> --
>> Dan Kinder
>> Principal Software Engineer
>> Turnitin – www.turnitin.com
>> dkin...@turnitin.com
>>
>


Re: MemtableReclaimMemory pending building up

2016-03-02 Thread Dan Kinder
Also should note: Cassandra 2.2.5, Centos 6.7

On Wed, Mar 2, 2016 at 1:34 PM, Dan Kinder  wrote:

> Hi y'all,
>
> I am writing to a cluster fairly fast and seeing this odd behavior happen,
> seemingly to single nodes at a time. The node starts to take more and more
> memory (instance has 48GB memory on G1GC). tpstats shows that
> MemtableReclaimMemory Pending starts to grow first, then later
> MutationStage builds up as well. By then most of the memory is being
> consumed, GC is getting longer, node slows down and everything slows down
> unless I kill the node. Also the number of Active MemtableReclaimMemory
> threads seems to stay at 1. Also interestingly, neither CPU nor disk
> utilization are pegged while this is going on; it's on jbod and there is
> plenty of headroom there. (Note that there is a decent number of
> compactions going on as well but that is expected on these nodes and this
> particular one is catching up from a high volume of writes).
>
> Anyone have any theories on why this would be happening?
>
>
> $ nodetool tpstats
> Pool NameActive   Pending  Completed   Blocked
>  All time blocked
> MutationStage   192715481  311327142 0
> 0
> ReadStage 7 09142871 0
> 0
> RequestResponseStage  1 0  690823199 0
> 0
> ReadRepairStage   0 02145627 0
> 0
> CounterMutationStage  0 0  0 0
> 0
> HintedHandoff 0 0144 0
> 0
> MiscStage 0 0  0 0
> 0
> CompactionExecutor   1224  41022 0
> 0
> MemtableReclaimMemory 1   102   4263 0
> 0
> PendingRangeCalculator0 0 10 0
> 0
> GossipStage   0 0 148329 0
> 0
> MigrationStage0 0  0 0
> 0
> MemtablePostFlush 0 0   5233 0
> 0
> ValidationExecutor0 0  0 0
> 0
> Sampler   0 0  0 0
> 0
> MemtableFlushWriter   0 0   4270 0
> 0
> InternalResponseStage 0 0   16322698 0
> 0
> AntiEntropyStage  0 0  0 0
> 0
> CacheCleanupExecutor  0 0  0 0
> 0
> Native-Transport-Requests25 0  547935519 0
>   2586907
>
> Message type   Dropped
> READ 0
> RANGE_SLICE  0
> _TRACE   0
> MUTATION287057
> COUNTER_MUTATION 0
> REQUEST_RESPONSE 0
> PAGED_RANGE  0
> READ_REPAIR149
>
>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


MemtableReclaimMemory pending building up

2016-03-02 Thread Dan Kinder
Hi y'all,

I am writing to a cluster fairly fast and seeing this odd behavior happen,
seemingly to single nodes at a time. The node starts to take more and more
memory (instance has 48GB memory on G1GC). tpstats shows that
MemtableReclaimMemory Pending starts to grow first, then later
MutationStage builds up as well. By then most of the memory is being
consumed, GC is getting longer, node slows down and everything slows down
unless I kill the node. Also the number of Active MemtableReclaimMemory
threads seems to stay at 1. Also interestingly, neither CPU nor disk
utilization are pegged while this is going on; it's on jbod and there is
plenty of headroom there. (Note that there is a decent number of
compactions going on as well but that is expected on these nodes and this
particular one is catching up from a high volume of writes).

Anyone have any theories on why this would be happening?


$ nodetool tpstats
Pool NameActive   Pending  Completed   Blocked  All
time blocked
MutationStage   192715481  311327142 0
0
ReadStage 7 09142871 0
0
RequestResponseStage  1 0  690823199 0
0
ReadRepairStage   0 02145627 0
0
CounterMutationStage  0 0  0 0
0
HintedHandoff 0 0144 0
0
MiscStage 0 0  0 0
0
CompactionExecutor   1224  41022 0
0
MemtableReclaimMemory 1   102   4263 0
0
PendingRangeCalculator0 0 10 0
0
GossipStage   0 0 148329 0
0
MigrationStage0 0  0 0
0
MemtablePostFlush 0 0   5233 0
0
ValidationExecutor0 0  0 0
0
Sampler   0 0  0 0
0
MemtableFlushWriter   0 0   4270 0
0
InternalResponseStage 0 0   16322698 0
0
AntiEntropyStage  0 0  0 0
0
CacheCleanupExecutor  0 0  0 0
0
Native-Transport-Requests25 0  547935519 0
  2586907

Message type   Dropped
READ 0
RANGE_SLICE  0
_TRACE   0
MUTATION287057
COUNTER_MUTATION 0
REQUEST_RESPONSE 0
PAGED_RANGE  0
READ_REPAIR149


Re: Production with Single Node

2016-01-22 Thread Dan Kinder
I could see this being desirable if you are deploying the exact same
application as you deploy in other places with many nodes, and you know the
load will be low. It may be a rare situation but in such a case you save
big effort by not having to change your application logic.

Not that I necessarily recommend it but to answer John's question: my
understanding is that you want to keep it snappy and low-latency you should
watch out for GC pause and consider your GC tuning carefully, it being a
single node will cause the whole show to stop. Presumably your load won't
be very high.

Also if you are concerned with durability you may want to consider changing
commitlog_sync
<https://docs.datastax.com/en/cassandra/1.2/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__commitlog_sync>
to
batch. I believe this is the only way to guarantee write durability with
one node. Again with the performance caveat; under high load it could cause
problems.

On Fri, Jan 22, 2016 at 12:34 PM, Jonathan Haddad  wrote:

> My opinion:
> http://rustyrazorblade.com/2013/09/cassandra-faq-can-i-start-with-a-single-node/
>
> TL;DR: the only reason to run 1 node in prod is if you're super broke but
> know you'll need to scale up almost immediately after going to prod (maybe
> after getting some funding).
>
> If you're planning on doing it as a more permanent solution, you've chosen
> the wrong database.
>
> On Fri, Jan 22, 2016 at 12:30 PM Jack Krupansky 
> wrote:
>
>> The risks would be about the same as with a single-node Postgres or MySQL
>> database, except that you wouldn't have the benefit of full SQL.
>>
>> How much data (rows, columns), what kind of load pattern (heavy write,
>> heavy update, heavy query), and what types of queries (primary key-only,
>> slices, filtering, secondary indexes, etc.)?
>>
>> -- Jack Krupansky
>>
>> On Fri, Jan 22, 2016 at 3:24 PM, John Lammers <
>> john.lamm...@karoshealth.com> wrote:
>>
>>> After deploying a number of production systems with up to 10 Cassandra
>>> nodes each, we are looking at deploying a small, all-in-one-server system
>>> with only a single, local node (Cassandra 2.1.11).
>>>
>>> What are the risks of such a configuration?
>>>
>>> The virtual disk would be running RAID 5 and the disk controller would
>>> have a flash backed write-behind cache.
>>>
>>> What's the best way to configure Cassandra and/or respecify the hardware
>>> for an all-in-one-box solution?
>>>
>>> Thanks-in-advance!
>>>
>>> --John
>>>
>>>
>>


-- 
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: compression cpu overhead

2015-11-04 Thread Dan Kinder
To clarify, writes have no *immediate* cpu cost from adding the write to
the memtable, however the compression overhead cost is paid when writing
out a new SSTable (whether from flushing a memtable or compacting), correct?

So it sounds like when reads >> writes then Tushar's comments are accurate,
but for a high write workload flushing and compactions would create most of
the overhead.

On Tue, Nov 3, 2015 at 6:03 PM, Jon Haddad 
wrote:

> You won't see any overhead on writes because you don't actually write to
> sstables when performing a write.  Just the commit log & memtable.
> Memtables are flushes asynchronously.
>
> On Nov 4, 2015, at 1:57 AM, Tushar Agrawal 
> wrote:
>
> For writes it's negligible. For reads it makes a significant difference
> for high tps and low latency workload. You would see up to 3x higher cpu
> with LZ4 vs no compression. It would be different for different h/w
> configurations.
>
>
> Thanks,
> Tushar
> (Sent from iPhone)
>
> On Nov 3, 2015, at 5:51 PM, Dan Kinder  wrote:
>
> Most concerned about write since that's where most of the cost is, but
> perf numbers for a any workload mix would be helpful.
>
> On Tue, Nov 3, 2015 at 3:48 PM, Graham Sanderson  wrote:
>
>> On read or write?
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-7039 and friends in 2.2
>> should make some difference, I didn’t immediately find perf numbers though.
>>
>> On Nov 3, 2015, at 5:42 PM, Dan Kinder  wrote:
>>
>> Hey all,
>>
>> Just wondering if anyone has done seen or done any benchmarking for the
>> actual CPU overhead added by various compression algorithms in Cassandra
>> (at least LZ4) vs no compression. Clearly this is going to be workload
>> dependent but even a rough gauge would be helpful (ex. "Turning on LZ4
>> compression increases my CPU load by ~2x")
>>
>> -dan
>>
>>
>>
>
>
> --
> Dan Kinder
> Senior Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: compression cpu overhead

2015-11-03 Thread Dan Kinder
Most concerned about write since that's where most of the cost is, but perf
numbers for a any workload mix would be helpful.

On Tue, Nov 3, 2015 at 3:48 PM, Graham Sanderson  wrote:

> On read or write?
>
> https://issues.apache.org/jira/browse/CASSANDRA-7039 and friends in 2.2
> should make some difference, I didn’t immediately find perf numbers though.
>
> On Nov 3, 2015, at 5:42 PM, Dan Kinder  wrote:
>
> Hey all,
>
> Just wondering if anyone has done seen or done any benchmarking for the
> actual CPU overhead added by various compression algorithms in Cassandra
> (at least LZ4) vs no compression. Clearly this is going to be workload
> dependent but even a rough gauge would be helpful (ex. "Turning on LZ4
> compression increases my CPU load by ~2x")
>
> -dan
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


compression cpu overhead

2015-11-03 Thread Dan Kinder
Hey all,

Just wondering if anyone has done seen or done any benchmarking for the
actual CPU overhead added by various compression algorithms in Cassandra
(at least LZ4) vs no compression. Clearly this is going to be workload
dependent but even a rough gauge would be helpful (ex. "Turning on LZ4
compression increases my CPU load by ~2x")

-dan


Re: memtable flush size with LCS

2015-11-02 Thread Dan Kinder
@Jeff Jirsa thanks the memtable_* keys were the actual determining factor
for my memtable flushes, they are what I needed to play with.

On Thu, Oct 29, 2015 at 8:23 AM, Ken Hancock 
wrote:

> Or if you're doing a high volume of writes, then your flushed file size
> may be completely determined by other CFs that have consumed the commitlog
> size, forcing any memtables whose commitlog is being delete to be forced to
> disk.
>
>
> On Wed, Oct 28, 2015 at 2:51 PM, Jeff Jirsa 
> wrote:
>
>> It’s worth mentioning that initial flushed file size is typically
>> determined by memtable_cleanup_threshold and the memtable space options
>> (memtable_heap_space_in_mb, memtable_offheap_space_in_mb, depending on
>> memtable_allocation_type)
>>
>>
>>
>> From: Nate McCall
>> Reply-To: "user@cassandra.apache.org"
>> Date: Wednesday, October 28, 2015 at 11:45 AM
>> To: Cassandra Users
>> Subject: Re: memtable flush size with LCS
>>
>>
>>  do you mean that this property is ignored at memtable flush time, and so
>>> memtables are already allowed to be much larger than sstable_size_in_mb?
>>>
>>
>> Yes, 'sstable_size_in_mb' plays no part in the flush process. Flushing
>> is based on solely on runtime activity and the file size is determined by
>> whatever was in the memtable at that time.
>>
>>
>>
>> --
>> -
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: memtable flush size with LCS

2015-10-27 Thread Dan Kinder
Thanks, I am using most of the suggested parameters to tune compactions. To
clarify, when you say "The sstable_size_in_mb can be thought of a target
for the compaction process moving the file beyond L0." do you mean that
this property is ignored at memtable flush time, and so memtables are
already allowed to be much larger than sstable_size_in_mb?

On Tue, Oct 27, 2015 at 2:57 PM, Nate McCall  wrote:

> The sstable_size_in_mb can be thought of a target for the compaction
> process moving the file beyond L0.
>
> Note: If there are more than 32 SSTables in L0, it will switch over to
> doing STCS for L0 (you can disable this behavior by passing
> -Dcassandra.disable_stcs_in_l0=true as a system property).
>
> With a lot of overwrites, the settings you want to tune will be
> gc_grace_seconds in combination with tombstone_threhsold,
> tombstone_compaction_interval and maybe unchecked_tombstone_compaction
> (there are different opinions about this last one, YMMV). Making these more
> aggressive and increasing your sstable_size_in_mb will allow for
> potentially capturing more overwrites in a level which will lead to less
> fragmentation. However, making the size too large will keep compaction from
> triggering on further out levels which can then exacerbate problems
> particulary if you have long-lived TTLs.
>
> In general, it is very workload specific, but monitoring the histogram for
> the number of ssables used in a read (via
> org.apache.cassandra.metrics.ColumnFamily.$KEYSPACE.$TABLE.SSTablesPerReadHistogram.95percentile
> or shown manually in nodetool cfhistograms output) after any change will
> help you narrow in a good setting.
>
> See
> http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html?scroll=compactSubprop__compactionSubpropertiesLCS
> for more details.
>
> On Tue, Oct 27, 2015 at 3:42 PM, Dan Kinder  wrote:
> >
> > Hi all,
> >
> > The docs indicate that memtables are triggered to flush when data in the
> commitlog is expiring or based on memtable_flush_period_in_ms.
> >
> > But LCS has a specified sstable size; when using LCS are memtables
> flushed when they hit the desired sstable size (default 160MB) or could L0
> sstables be much larger than that?
> >
> > Wondering because I have an overwrite workload where larger memtables
> would be helpful, and if I need to increase my LCS sstable size in order to
> allow for that.
> >
> > -dan
>
>
>
>
> --
> -
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


memtable flush size with LCS

2015-10-27 Thread Dan Kinder
Hi all,

The docs indicate that memtables are triggered to flush when data in the
commitlog is expiring or based on memtable_flush_period_in_ms.

But LCS has a specified sstable size; when using LCS are memtables flushed
when they hit the desired sstable size (default 160MB) or could L0 sstables
be much larger than that?

Wondering because I have an overwrite workload where larger memtables would
be helpful, and if I need to increase my LCS sstable size in order to allow
for that.

-dan


future very wide row support

2015-08-31 Thread Dan Kinder
Hi,

My understanding is that wide row support (i.e. many columns/CQL-rows/cells
per partition key) has gotten much better in the past few years; even
though the theoretical of 2 billion has been much higher than practical for
a long time, it seems like now Cassandra is able to handle these better
(ex. incremental compactions so Cassandra doesn't OOM).

So I'm wondering:

   - With more recent improvements (say, including up to 2.2 or maybe 3.0),
   is the practical limit still much lower than 2 billion? Do we have any idea
   what limits us in this regard? (Maybe repair is still another bottleneck?)
   - Is the 2 billion limit a SSTable limitation?
   https://issues.apache.org/jira/browse/CASSANDRA-7447 seems to indicate
   that it might be. Is there any future work we think will increase this
   limit?

A couple of caveats:

I am aware that even if such a large partition is possible it may not
usually be practical because it works against Cassandra's primary feature
of sharding data to multiple nodes and parallelize access. However some
analytics/batch processing use-cases could benefit from the guarantee that
a certain set of data is together on a node. It can also make certain data
modeling situations a bit easier, where currently we just need to model
around the limitation. Also, 2 billion rows for small columns only adds up
to data in the tens of gigabytes, and use of larger nodes these days means
that practically one node could hold much larger partitions. And lastly,
there are just cases where the 99.999% of partition keys are going to be
pretty small, but there are potential outliers that could be very large; it
would be great for Cassandra to handle these even if it is suboptimal,
helping us all avoid having to model around such exceptions.

Well, this turned into something of an essay... thanks for reading and glad
to receive input on this.


Re: Overwhelming tombstones with LCS

2015-07-10 Thread Dan Kinder
On Sun, Jul 5, 2015 at 1:40 PM, Roman Tkachenko  wrote:

> Hey guys,
>
> I have a table with RF=3 and LCS. Data model makes use of "wide rows". A
> certain query run against this table times out and tracing reveals the
> following error on two out of three nodes:
>
> *Scanned over 10 tombstones; query aborted (see
> tombstone_failure_threshold)*
>
> This basically means every request with CL higher than "one" fails.
>
> I have two questions:
>
> * How could it happen that only two out of three nodes have overwhelming
> tombstones? For the third node tracing shows sensible *"Read 815 live and
> 837 tombstoned cells"* traces.
>

One theory: before 2.1.6 compactions on wide rows with lots of tombstones
could take forever or potentially never finish. What version of Cassandra
are you on? It may be that you got lucky with one node that has been able
to keep up but the others haven't been able to.


>
> * Anything I can do to fix those two nodes? I have already set gc_grace to
> 1 day and tried to make compaction strategy more aggressive
> (unchecked_tombstone_compaction - true, tombstone_threshold - 0.01) to no
> avail - a couple of days have already passed and it still gives the same
> error.
>

You probably want major compaction which is coming soon for LCS (
https://issues.apache.org/jira/browse/CASSANDRA-7272) but not here yet.

The alternative is, if you have enough time and headroom (this is going to
do some pretty serious compaction so be careful), alter your table to STCS,
let it compact into one SSTable, then convert back to LCS. It's pretty
heavy-handed but as long as your gc_grace is low enough it'll do the job.
Definitely do NOT do this if you have many tombstones in single wide rows
and are not >2.1.6


>
> Thanks!
>
> Roman
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Commitlog still replaying after drain && shutdown

2015-06-30 Thread Dan Kinder
Hi all,

To quote Sebastian Estevez in one recent thread: "You said you ran a
nodetool drain before the restart, but your logs show commitlogs replayed.
That does not add up..." The docs seem to generally agree with this: if you
did `nodetool drain` before restarting your node there shouldn't be any
commitlogs.

But my experience has been that if I do `nodetool drain`, I need to wait at
least 30-60 seconds after it has finished if I really want no commitlog
replay on restart. If I restart immediately (or even 10-20s later) then it
replays plenty. (This was true on 2.X and is still true on 2.1.7 for me.)

Is this unusual or the same thing others see? Is `nodetool drain` really
supposed to wait until all memtables are flushed and commitlogs are deleted
before it returns?

Thanks,
-dan


Re: counters still inconsistent after repair

2015-06-19 Thread Dan Kinder
Thanks Rob, this was helpful.

More counters will be added soon, I'll let you know if those have any
problems.

On Mon, Jun 15, 2015 at 4:32 PM, Robert Coli  wrote:

> On Mon, Jun 15, 2015 at 2:52 PM, Dan Kinder  wrote:
>
>> Potentially relevant facts:
>> - Recently upgraded to 2.1.6 from 2.0.14
>> - This table has ~million rows, low contention, and fairly high increment
>> rate
>>
> Can you repro on a counter that was created after the upgrade?
>
>> Mainly wondering:
>>
>> - Is this known or expected? I know Cassandra counters have had issues
>> but thought by now it should be able to keep a consistent counter or at
>> least repair it...
>>
> All counters which haven't been written to after 2.1 "new counters" are
> still on disk as "old counters" and will remain that way until UPDATEd and
> then compacted together with all old shards. "Old counters" can exhibit
> this behavior.
>
>> - Any way to "reset" this counter?
>>
> Per Aleksey (in IRC) you can turn a replica for an old counter into a new
> counter by UPDATEing it once.
>
> In order to do that without modifying the count, you can [1] :
>
> UPDATE tablename SET countercolumn = countercolumn +0 where id = 1;
>
> The important caveat that this must be done at least once per shard, with
> one shard per RF. The only way one can be sure that all shards have been
> UPDATEd is by contacting each replica node and doing the UPDATE + 0 there,
> because local writes are preferred.
>
> To summarize, the optimal process to upgrade your pre-existing counters to
> 2.1-era "new counters" :
>
> 1) get a list of all counter keys
> 2) get a list of replicas per counter key
> 3) connect to each replica for each counter key and issue an UPDATE + 0
> for that counter key
> 4) run a major compaction
>
> As an aside, Aleksey suggests that the above process is so heavyweight
> that it may not be worth it. If you just leave them be, all counters you're
> actually used will become progressively more accurate over time.
>
> =Rob
> [1] Special thanks to Jeff Jirsa for verifying that this syntax works.
>



-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


counters still inconsistent after repair

2015-06-15 Thread Dan Kinder
Currently on 2.1.6 I'm seeing behavior like the following:

cqlsh:walker> select * from counter_table where field = 'test';
 field | value
---+---
 test  |30
(1 rows)
cqlsh:walker> select * from counter_table where field = 'test';
 field | value
---+---
 test  |90
(1 rows)
cqlsh:walker> select * from counter_table where field = 'test';
 field | value
---+---
 test  |30
(1 rows)

Using tracing I can see that one node has wrong data. However running
repair on this table does not seem to have done anything, I still see the
wrong value returned from this same node.

Potentially relevant facts:
- Recently upgraded to 2.1.6 from 2.0.14
- This table has ~million rows, low contention, and fairly high increment
rate

Mainly wondering:
- Is this known or expected? I know Cassandra counters have had issues but
thought by now it should be able to keep a consistent counter or at least
repair it...
- Any way to "reset" this counter?
- Any other stuff I can check?


Re: Multiple cassandra instances per physical node

2015-05-21 Thread Dan Kinder
@James Rothering yeah I was thinking of container in a broad sense: either
full virtual machines, docker containers, straight LXC, or whatever else
would allow the Cassandra nodes to have their own IPs and bind to default
ports.

@Jonathan Haddad thanks for the blog post. To ensure the same host does not
replicate its own data, would I basically need the nodes on a single host
to be labeled as one rack? (Assuming I use vnodes)

On Thu, May 21, 2015 at 1:02 PM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> JBOD --> just a bunch of disks, no raid.
>
> All the best,
>
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Sebastián Estévez
>
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>
> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
> <https://twitter.com/datastax> [image: g+.png]
> <https://plus.google.com/+Datastax/about>
> <http://feeds.feedburner.com/datastax>
>
> <http://cassandrasummit-datastax.com/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
> On Thu, May 21, 2015 at 4:00 PM, James Rothering 
> wrote:
>
>> Hmmm ... Not familiar with JBOD. Is that just RAID-0?
>>
>> Also ... wrt  the container talk, is that a Docker container you're
>> talking about?
>>
>>
>>
>> On Thu, May 21, 2015 at 12:48 PM, Jonathan Haddad 
>> wrote:
>>
>>> If you run it in a container with dedicated IPs it'll work just fine.
>>> Just be sure you aren't using the same machine to replicate it's own data.
>>>
>>> On Thu, May 21, 2015 at 12:43 PM Manoj Khangaonkar <
>>> khangaon...@gmail.com> wrote:
>>>
>>>> +1.
>>>>
>>>> I agree we need to be able to run multiple server instances on one
>>>> physical machine. This is especially necessary in development and test
>>>> environments where one is experimenting and needs a cluster, but do not
>>>> have access to multiple physical machines.
>>>>
>>>> If you google , you  can find a few blogs that talk about how to do
>>>> this.
>>>>
>>>> But it is less than ideal. We need to be able to do it by changing
>>>> ports in cassandra.yaml. ( The way it is done easily with Hadoop or Apache
>>>> Kafka or Redis and many other distributed systems)
>>>>
>>>>
>>>> regards
>>>>
>>>>
>>>>
>>>> On Thu, May 21, 2015 at 10:32 AM, Dan Kinder 
>>>> wrote:
>>>>
>>>>> Hi, I'd just like some clarity and advice regarding running multiple
>>>>> cassandra instances on a single large machine (big JBOD array, plenty of
>>>>> CPU/RAM).
>>>>>
>>>>> First, I am aware this was not Cassandra's original design, and doing
>>>>> this seems to unreasonably go against the "commodity hardware" intentions
>>>>> of Cassandra's design. In general it seems to be recommended against (at
>>>>> least as far as I've heard from @Rob Coli and others).
>>>>>
>>>>> However maybe this term "commodity" is changing... my hardware/ops
>>>>> team argues that due to cooling, power, and other datacenter costs, having
>>>>> slightly larger nodes (>=32G RAM, >=24 CPU, >=8 disks JBOD) is actually a
>>>>> better price point. Now, I am not a hardware guy, so if this is not
>>>>> actually true I'd love to hear why, otherwise I pretty much need to take
>>>>> them at their word.
>>>>>
>>>>> Now, Cassandra features seemed to have improved such that JBOD works
>>>>> fairly well, but especially with memory/GC this seems to be reaching its
>>>>> limit. One Cassandra instance can only scale up so much.
>>>>>
>>>>> So my question is: suppose I take a 12 disk JBOD and run 2 Cassandra
>>>>> nodes (each with 5 data disks, 1 commit log disk) and either give each its
>>>>> own container & IP or change the listen ports. Will this work? What are 
>>>>> the
>>>>> risks? Will/should Cassandra support this better in the future?
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> http://khangaonkar.blogspot.com/
>>>>
>>>
>>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Multiple cassandra instances per physical node

2015-05-21 Thread Dan Kinder
Hi, I'd just like some clarity and advice regarding running multiple
cassandra instances on a single large machine (big JBOD array, plenty of
CPU/RAM).

First, I am aware this was not Cassandra's original design, and doing this
seems to unreasonably go against the "commodity hardware" intentions of
Cassandra's design. In general it seems to be recommended against (at least
as far as I've heard from @Rob Coli and others).

However maybe this term "commodity" is changing... my hardware/ops team
argues that due to cooling, power, and other datacenter costs, having
slightly larger nodes (>=32G RAM, >=24 CPU, >=8 disks JBOD) is actually a
better price point. Now, I am not a hardware guy, so if this is not
actually true I'd love to hear why, otherwise I pretty much need to take
them at their word.

Now, Cassandra features seemed to have improved such that JBOD works fairly
well, but especially with memory/GC this seems to be reaching its limit.
One Cassandra instance can only scale up so much.

So my question is: suppose I take a 12 disk JBOD and run 2 Cassandra nodes
(each with 5 data disks, 1 commit log disk) and either give each its own
container & IP or change the listen ports. Will this work? What are the
risks? Will/should Cassandra support this better in the future?


Delete query range limitation

2015-04-15 Thread Dan Kinder
I understand that range deletes are currently not supported (
http://stackoverflow.com/questions/19390335/cassandra-cql-delete-using-a-less-than-operator-on-a-secondary-key
)

Since Cassandra now does have range tombstones is there a reason why it
can't be allowed? Is there a ticket for supporting this or is it a
deliberate design decision not to?


Re: Finding nodes that own a given token/partition key

2015-03-26 Thread Dan Kinder
Thanks guys, think both of these answer my question. Guess I had overlooked
nodetool getendpoints. Hopefully findable by future googlers now.

On Thu, Mar 26, 2015 at 2:37 PM, Adam Holmberg 
wrote:

> Dan,
>
> Depending on your context, many of the DataStax drivers have the token
> ring exposed client-side.
>
> For example,
> Python:
> http://datastax.github.io/python-driver/api/cassandra/metadata.html#tokens-and-ring-topology
> Java:
> http://www.datastax.com/drivers/java/2.1/com/datastax/driver/core/Metadata.html
>
> You may not have to construct this yourself.
>
> Adam Holmberg
>
> On Thu, Mar 26, 2015 at 3:53 PM, Roman Tkachenko 
> wrote:
>
>> Hi Dan,
>>
>> Have you tried using "nodetool getendpoints"? It shows you nodes that
>> currently own the specific key.
>>
>> Roman
>>
>> On Thu, Mar 26, 2015 at 1:21 PM, Dan Kinder  wrote:
>>
>>> Hey all,
>>>
>>> In certain cases it would be useful for us to find out which node(s)
>>> have the data for a given token/partition key.
>>>
>>> The only solutions I'm aware of is to select from system.local and/or
>>> system.peers to grab the host_id and tokens, do `SELECT token(thing) FROM
>>> myks.mytable WHERE thing = 'value';`, then do the math (put the ring
>>> together) and figure out which node has the data. I'm assuming this is what
>>> token aware drivers are doing.
>>>
>>> Is there a simpler way to do this?
>>>
>>> A bit more context: we'd like to move some processing closer to data,
>>> but for a few reasons hadoop/spark aren't good options for the moment.
>>>
>>
>>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Finding nodes that own a given token/partition key

2015-03-26 Thread Dan Kinder
Hey all,

In certain cases it would be useful for us to find out which node(s) have
the data for a given token/partition key.

The only solutions I'm aware of is to select from system.local and/or
system.peers to grab the host_id and tokens, do `SELECT token(thing) FROM
myks.mytable WHERE thing = 'value';`, then do the math (put the ring
together) and figure out which node has the data. I'm assuming this is what
token aware drivers are doing.

Is there a simpler way to do this?

A bit more context: we'd like to move some processing closer to data, but
for a few reasons hadoop/spark aren't good options for the moment.


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-03 Thread Dan Kinder
Per Aleksey Yeschenko's comment on that ticket, it does seem like a
timestamp granularity issue, but it should work properly if it is within
the same session. gocql by default uses 2 connections and 128 streams per
connection. If you set it to 1 connection with 1 stream this problem goes
away. I suppose that'll take care of it in testing.

At least one interesting conclusion here: a gocql.Session does not map to
one Cassandra "session". This makes some sense given that gocql says to use
Session shared concurrently (so it better not just be one Cassandra
session), but it is a bit concerning that there is no way to make this 100%
safe outside of cutting the gocql.Session down to 1 connection and stream.

On Mon, Mar 2, 2015 at 5:34 PM, Peter Sanford 
wrote:

> The more I think about it, the more this feels like a column timestamp
> issue. If two inserts have the same timestamp then the values are compared
> lexically to decide which one to keep (which I think explains the
> "99"/"100" "999"/"1000" mystery).
>
> We can verify this by also selecting out the WRITETIME of the column:
>
> ...
> var prevTS int
> for i := 0; i < 1; i++ {
> val := fmt.Sprintf("%d", i)
> db.Query("UPDATE ut.test SET val = ? WHERE key = 'foo'", val).Exec()
>
> var result string
> var ts int
> db.Query("SELECT val, WRITETIME(val) FROM ut.test WHERE key =
> 'foo'").Scan(&result, &ts)
> if result != val {
> fmt.Printf("Expected %v but got: %v; (prevTS:%d, ts:%d)\n", val, result,
> prevTS, ts)
> }
> prevTS = ts
> }
>
>
> When I run it with this change I see that the timestamps are in fact the
> same:
>
> Expected 10 but got: 9; (prevTS:1425345839903000, ts:1425345839903000)
> Expected 100 but got: 99; (prevTS:1425345839939000, ts:1425345839939000)
> Expected 101 but got: 99; (prevTS:1425345839939000, ts:1425345839939000)
> Expected 1000 but got: 999; (prevTS:1425345840296000, ts:1425345840296000)
>
>
> It looks like we're only getting millisecond precision instead of
> microsecond for the column timestamps?! If you explicitly set the timestamp
> value when you do the insert, you can get actual microsecond precision and
> the issue should go away.
>
> -psanford
>
> On Mon, Mar 2, 2015 at 4:21 PM, Dan Kinder  wrote:
>
>> Yeah I thought that was suspicious too, it's mysterious and fairly
>> consistent. (By the way I had error checking but removed it for email
>> brevity, but thanks for verifying :) )
>>
>> On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford 
>> wrote:
>>
>>> Hmm. I was able to reproduce the behavior with your go program on my dev
>>> machine (C* 2.0.12). I was hoping it was going to just be an unchecked
>>> error from the .Exec() or .Scan(), but that is not the case for me.
>>>
>>> The fact that the issue seems to happen on loop iteration 10, 100 and
>>> 1000 is pretty suspicious. I took a tcpdump to confirm that the gocql was
>>> in fact sending the "write 100" query and then on the next read Cassandra
>>> responded with "99".
>>>
>>> I'll be interested to see what the result of the jira ticket is.
>>>
>>> -psanford
>>>
>>>
>>
>>
>> --
>> Dan Kinder
>> Senior Software Engineer
>> Turnitin – www.turnitin.com
>> dkin...@turnitin.com
>>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Yeah I thought that was suspicious too, it's mysterious and fairly
consistent. (By the way I had error checking but removed it for email
brevity, but thanks for verifying :) )

On Mon, Mar 2, 2015 at 4:13 PM, Peter Sanford 
wrote:

> Hmm. I was able to reproduce the behavior with your go program on my dev
> machine (C* 2.0.12). I was hoping it was going to just be an unchecked
> error from the .Exec() or .Scan(), but that is not the case for me.
>
> The fact that the issue seems to happen on loop iteration 10, 100 and 1000
> is pretty suspicious. I took a tcpdump to confirm that the gocql was in
> fact sending the "write 100" query and then on the next read Cassandra
> responded with "99".
>
> I'll be interested to see what the result of the jira ticket is.
>
> -psanford
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Done: https://issues.apache.org/jira/browse/CASSANDRA-8892

On Mon, Mar 2, 2015 at 3:26 PM, Robert Coli  wrote:

> On Mon, Mar 2, 2015 at 11:44 AM, Dan Kinder  wrote:
>
>> I had been having the same problem as in those older post:
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E
>>
>
> As I said on that thread :
>
> "It sounds unreasonable/unexpected to me, if you have a trivial repro
> case, I would file a JIRA."
>
> =Rob
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Less frequent flushing with LCS

2015-03-02 Thread Dan Kinder
Nope, they flush every 5 to 10 minutes.

On Mon, Mar 2, 2015 at 1:13 PM, Daniel Chia  wrote:

> Do the tables look like they're being flushed every hour? It seems like
> the setting memtable_flush_after_mins which I believe defaults to 60
> could also affect how often your tables are flushed.
>
> Thanks,
> Daniel
>
> On Mon, Mar 2, 2015 at 11:49 AM, Dan Kinder  wrote:
>
>> I see, thanks for the input. Compression is not enabled at the moment,
>> but I may try increasing that number regardless.
>>
>> Also I don't think in-memory tables would work since the dataset is
>> actually quite large. The pattern is more like a given set of rows will
>> receive many overwriting updates and then not be touched for a while.
>>
>> On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli 
>> wrote:
>>
>>> On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder 
>>> wrote:
>>>
>>>> Theoretically sstable_size_in_mb could be causing it to flush (it's at
>>>> the default 160MB)... though we are flushing well before we hit 160MB. I
>>>> have not tried changing this but we don't necessarily want all the sstables
>>>> to be large anyway,
>>>>
>>>
>>> I've always wished that the log message told you *why* the SSTable was
>>> being flushed, which of the various bounds prompted the flush.
>>>
>>> In your case, the size on disk may be under 160MB because compression is
>>> enabled. I would start by increasing that size.
>>>
>>> Datastax DSE has in-memory tables for this use case.
>>>
>>> =Rob
>>>
>>>
>>
>>
>> --
>> Dan Kinder
>> Senior Software Engineer
>> Turnitin – www.turnitin.com
>> dkin...@turnitin.com
>>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: Less frequent flushing with LCS

2015-03-02 Thread Dan Kinder
I see, thanks for the input. Compression is not enabled at the moment, but
I may try increasing that number regardless.

Also I don't think in-memory tables would work since the dataset is
actually quite large. The pattern is more like a given set of rows will
receive many overwriting updates and then not be touched for a while.

On Fri, Feb 27, 2015 at 2:27 PM, Robert Coli  wrote:

> On Fri, Feb 27, 2015 at 2:01 PM, Dan Kinder  wrote:
>
>> Theoretically sstable_size_in_mb could be causing it to flush (it's at
>> the default 160MB)... though we are flushing well before we hit 160MB. I
>> have not tried changing this but we don't necessarily want all the sstables
>> to be large anyway,
>>
>
> I've always wished that the log message told you *why* the SSTable was
> being flushed, which of the various bounds prompted the flush.
>
> In your case, the size on disk may be under 160MB because compression is
> enabled. I would start by increasing that size.
>
> Datastax DSE has in-memory tables for this use case.
>
> =Rob
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Reboot: Read After Write Inconsistent Even On A One Node Cluster

2015-03-02 Thread Dan Kinder
Hey all,

I had been having the same problem as in those older post:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201411.mbox/%3CCAORswtz+W4Eg2CoYdnEcYYxp9dARWsotaCkyvS5M7+Uo6HT1=a...@mail.gmail.com%3E

To summarize it, on my local box with just one cassandra node I can update
and then select the updated row and get an incorrect response.

My understanding is this may have to do with not having fine-grained enough
timestamp resolution, but regardless I'm wondering: is this actually a bug
or is there any way to mitigate it? It causes sporadic failures in our unit
tests, and having to Sleep() between tests isn't ideal. At least confirming
it's a bug would be nice though.

For those interested, here's a little go program that can reproduce the
issue. When I run it I typically see:
Expected 100 but got: 99
Expected 1000 but got: 999

--- main.go: ---

package main

import (
"fmt"

"github.com/gocql/gocql"
)

func main() {
cf := gocql.NewCluster("localhost")
db, _ := cf.CreateSession()
// Keyspace ut = "update test"
err := db.Query(`CREATE KEYSPACE IF NOT EXISTS ut
WITH REPLICATION = {'class': 'SimpleStrategy',
'replication_factor': 1 }`).Exec()
if err != nil {
panic(err.Error())
}
err = db.Query("CREATE TABLE IF NOT EXISTS ut.test (key text, val text,
PRIMARY KEY(key))").Exec()
if err != nil {panic(err.Error())
   }
err = db.Query("TRUNCATE ut.test").Exec()
if err != nil {
panic(err.Error())

}

err = db.Query("INSERT INTO ut.test (key) VALUES ('foo')").Exec()

if err != nil {

panic(err.Error())

}


for i := 0; i < 1; i++ {

val := fmt.Sprintf("%d", i)

db.Query("UPDATE ut.test SET val = ? WHERE key = 'foo'",
val).Exec()


var result string
db.Query("SELECT val FROM ut.test WHERE key = 'foo'").Scan(&result)
if result != val {
fmt.Printf("Expected %v but got: %v\n", val, result)
}
}

}


Less frequent flushing with LCS

2015-02-27 Thread Dan Kinder
Hi all,

We have a table in Cassandra where we frequently overwrite recent inserts.
Compaction does a fine job with this but ultimately larger memtables would
reduce compactions.

The question is: can we make Cassandra use larger memtables and flush less
frequently? What currently triggers the flushes? Opscenter shows them
flushing consistently at about 110MB in size, we have plenty of memory to
go larger.

According to
http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_memtable_thruput_c.html
we can up the commit log space threshold, but this does not help, there is
plenty of runway there.

Theoretically sstable_size_in_mb could be causing it to flush (it's at the
default 160MB)... though we are flushing well before we hit 160MB. I have
not tried changing this but we don't necessarily want all the sstables to
be large anyway,

Thanks,
-dan


Re: large range read in Cassandra

2015-02-02 Thread Dan Kinder
For the benefit of others, I ended up finding out that the CQL library I
was using (https://github.com/gocql/gocql) at this time leaves paging page
size defaulted to no paging, so Cassandra was trying to pull all rows of
the partition into memory at once. Setting the page size to a reasonable
number seems to have done the trick.

On Tue, Nov 25, 2014 at 2:54 PM, Dan Kinder  wrote:

> Thanks, very helpful Rob, I'll watch for that.
>
> On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli 
> wrote:
>
>> On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder 
>> wrote:
>>
>>> To be clear, I expect this range query to take a long time and perform
>>> relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
>>> https://issues.apache.org/jira/browse/CASSANDRA-4415,
>>> http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
>>> so that we aren't literally pulling the entire thing in. Am I
>>> misunderstanding this use case? Could you clarify why exactly it would slow
>>> way down? It seems like with each read it should be doing a simple range
>>> read from one or two sstables.
>>>
>>
>> If you're paging through a single partition, that's likely to be fine.
>> When you said "range reads ... over rows" my impression was you were
>> talking about attempting to page through millions of partitions.
>>
>> With that confusion cleared up, the likely explanation for lack of
>> availability in your case is heap pressure/GC time. Look for GCs around
>> that time. Also, if you're using authentication, make sure that your
>> authentication keyspace has a replication factor greater than 1.
>>
>> =Rob
>>
>>
>>
>
>
> --
> Dan Kinder
> Senior Software Engineer
> Turnitin – www.turnitin.com
> dkin...@turnitin.com
>



-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: STCS limitation with JBOD?

2015-01-06 Thread Dan Kinder
Thanks for the info guys. Regardless of the reason for using nodetool
compact, it seems like the question still stands... but he impression I'm
getting is that nodetool compact on JBOD as I described will basically fall
apart. Is that correct?

To answer Colin's question as an aside: we have a dataset with fairly high
insert load and periodic range reads (batch processing). We have a
situation where we may want rewrite some rows (changing the primary key) by
deleting and inserting as a new row. This is not something we would do on a
regular basis, but after or during the process a compact would greatly help
to clear out tombstones/rewritten data.

@Ryan Svihla it also sounds like your suggestion in this case would be:
create a new column family, rewrite all data into that, truncate/remove the
previous one, and replace it with the new one.

On Tue, Jan 6, 2015 at 9:39 AM, Ryan Svihla  wrote:

> nodetool compact is the ultimate "running with scissors" solution, far
> more people manage to stab themselves in the eye. Customers running with
> scissors successfully not withstanding.
>
> My favorite discussions usually tend to result:
>
>1. "We still have tombstones" ( so they set gc_grace_seconds to 0)
>2. "We added a node after fixing it and now a bunch of records that
>were deleted have come back" (usually after setting gc_grace_seconds to 0
>and then not blanking nodes that have been offline)
>3. Why are my read latencies so spikey?  (cause they're on STC and now
>have a giant single huge SStable which worked fine when their data set was
>tiny, now they're looking at 100 sstables on STC, which means slwww
>reads)
>4. "We still have tombstones" (yeah I know this again, but this is
>usually when they've switched to LCS, which basically noops with nodetool
>compact)
>
> All of this is managed when you have a team that understands the tradeoffs
> of nodetool compact, but I categorically reject it's a good experience for
> new users, as I've unfortunately had about dozen fire drills this year as a
> result of nodetool compact alone.
>
> Data modeling around partitions that are truncated when falling out of
> scope is typically far more manageable, works with any compaction strategy,
> and doesn't require operational awareness at the same scale.
>
> On Fri, Jan 2, 2015 at 2:15 PM, Robert Coli  wrote:
>
>> On Fri, Jan 2, 2015 at 11:28 AM, Colin  wrote:
>>
>>> Forcing a major compaction is usually a bad idea.  What is your reason
>>> for doing that?
>>>
>>
>> I'd say "often" and not "usually". Lots of people have schema where they
>> create way too much garbage, and major compaction can be a good response.
>> The docs' historic incoherent FUD notwithstanding.
>>
>> =Rob
>>
>>
>
>
>
> --
>
> Thanks,
> Ryan Svihla
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


STCS limitation with JBOD?

2015-01-02 Thread Dan Kinder
Hi,

Forcing a major compaction (using nodetool compact
)
with STCS will result in a single sstable (ignoring repair data). However
this seems like it could be a problem for large JBOD setups. For example if
I have 12 disks, 1T each, then it seems like on this node I cannot have one
column family store more than 1T worth of data (more or less), because all
the data will end up in a single sstable that can exist only on one disk.
Is this accurate? The compaction write path docs

give a bit of hope that cassandra could split the one final sstable across
the disks, but I doubt it is able to and want to confirm.

I imagine that RAID/LLVM, using LCS, or multiple cassandra instances not in
JBOD mode could be solutions to this (with their own problems), but want to
verify that this actually is a problem.

-dan


Re: large range read in Cassandra

2014-11-25 Thread Dan Kinder
Thanks, very helpful Rob, I'll watch for that.

On Tue, Nov 25, 2014 at 11:45 AM, Robert Coli  wrote:

> On Tue, Nov 25, 2014 at 10:45 AM, Dan Kinder  wrote:
>
>> To be clear, I expect this range query to take a long time and perform
>> relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
>> https://issues.apache.org/jira/browse/CASSANDRA-4415,
>> http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
>> so that we aren't literally pulling the entire thing in. Am I
>> misunderstanding this use case? Could you clarify why exactly it would slow
>> way down? It seems like with each read it should be doing a simple range
>> read from one or two sstables.
>>
>
> If you're paging through a single partition, that's likely to be fine.
> When you said "range reads ... over rows" my impression was you were
> talking about attempting to page through millions of partitions.
>
> With that confusion cleared up, the likely explanation for lack of
> availability in your case is heap pressure/GC time. Look for GCs around
> that time. Also, if you're using authentication, make sure that your
> authentication keyspace has a replication factor greater than 1.
>
> =Rob
>
>
>


-- 
Dan Kinder
Senior Software Engineer
Turnitin – www.turnitin.com
dkin...@turnitin.com


Re: large range read in Cassandra

2014-11-25 Thread Dan Kinder
Thanks Rob.

To be clear, I expect this range query to take a long time and perform
relatively heavy I/O. What I expected Cassandra to do was use auto-paging (
https://issues.apache.org/jira/browse/CASSANDRA-4415,
http://stackoverflow.com/questions/17664438/iterating-through-cassandra-wide-row-with-cql3)
so that we aren't literally pulling the entire thing in. Am I
misunderstanding this use case? Could you clarify why exactly it would slow
way down? It seems like with each read it should be doing a simple range
read from one or two sstables.

If this won't work then it may me we need to start using Hive/Spark/Pig
etc. sooner, or page it manually using LIMIT and WHERE > [the last returned
result].

On Mon, Nov 24, 2014 at 5:49 PM, Robert Coli  wrote:

> On Mon, Nov 24, 2014 at 4:26 PM, Dan Kinder  wrote:
>
>> We have a web crawler project currently based on Cassandra (
>> https://github.com/iParadigms/walker, written in Go and using the gocql
>> driver), with the following relevant usage pattern:
>>
>> - Big range reads over a CF to grab potentially millions of rows and
>> dispatch new links to crawl
>>
>
> If you really mean millions of storage rows, this is just about the worst
> case for Cassandra. The problem you're having is probably that you
> shouldn't try to do this in Cassandra.
>
> Your timeouts are either from the read actually taking longer than the
> timeout or from the reads provoking heap pressure and resulting GC.
>
> =Rob
>
>


large range read in Cassandra

2014-11-24 Thread Dan Kinder
Hi,

We have a web crawler project currently based on Cassandra (
https://github.com/iParadigms/walker, written in Go and using the gocql
driver), with the following relevant usage pattern:

- Big range reads over a CF to grab potentially millions of rows and
dispatch new links to crawl
- Fast insert of new links (effectively using Cassandra to deduplicate)

We ultimately planned on doing the batch processing step (the dispatching)
in a system like Spark, but for the time being it is also in Go. We believe
this should work fine given that Cassandra now properly allows chunked
iteration of columns in a CF.

The issue is, periodically while doing a particularly large range read,
other operations time out because that node is "busy". In an experimental
cluster with only two nodes (and replication factor of 2), I'll get an
error like: "Operation timed out - received only 1 responses." Indicating
that the second node took too long to reply. At the moment I have the long
range reads set to consistency level ANY but the rest of the operations are
on QUORUM, so on this cluster they require responses from both nodes. The
relevant CF is also using LeveledCompactionStrategy. This happens in both
Cassandra 2 and 2.1.

Despite this error I don't see any significant I/O, memory consumption, or
CPU usage.

Here are some of the configuration values I've played with:

Increasing timeouts:
read_request_timeout_in_ms:
15000
range_request_timeout_in_ms:
3
write_request_timeout_in_ms:
1
request_timeout_in_ms: 1

Getting rid of caches we don't need:
key_cache_size_in_mb: 0
row_cache_size_in_mb: 0

Each of the 2 nodes has an HDD for commit log and single HDD I'm using for
data. Hence the following thread config (maybe since I/O is not an issue I
should increase these?):
concurrent_reads: 16
concurrent_writes: 32
concurrent_counter_writes: 32

Because I have a large number columns and aren't doing random I/O I've
increased this:
column_index_size_in_kb: 2048

It's something of a mystery why this error comes up. Of course with a 3rd
node it will get masked if I am doing QUORUM operations, but it still seems
like it should not happen, and that there is some kind of head-of-line
blocking or other issue in Cassandra. I would like to increase the amount
of dispatching I'm doing because of this it bogs it down if I do.

Any suggestions for other things we can try here would be appreciated.

-dan