from:"Elliott Sims"

Re: Cassandra p95 latencies

2023-08-14 Thread Elliott Sims via user

1.  Check for Nagle/delayed-ack, but probably nodelay is getting set by the
driver so it shouldn't be a problem.
2.  Check for network latency (just regular old ping among hosts, during
traffic)
3.  Check your GC metrics and see if garbage collections line up with
outliers.  Some tuning can help there, depending on the pattern, but 40ms
p99 at least would be fairly normal for G1GC.
4.  Check actual local write times, and I/O times with iostat.  If you have
spinning drives 40ms is fairly expected.  It's high but not totally
unexpected for consumer-grade SSDs.  For enterprise-grade SSDs commit times
that long would be very unusual.  What are your commitlog_sync settings?

On Mon, Aug 14, 2023 at 8:43 AM Josh McKenzie  wrote:

> The queries are rightly designed
>
> Data modeling in Cassandra is 100% gray space; there unfortunately is no
> right or wrong design. You'll need to share basic shapes / contours of your
> data model for other folks to help you; seemingly innocuous things in a
> data model can cause unexpected issues w/C*'s storage engine paradigm
> thanks to the partitioning and data storage happening under the hood.
>
> If you were seeing single digit ms on 3.0.X or 3.11.X and 40ms p95 on 4.0
> I'd immediately look to the DB as being the culprit. For all other cases,
> you should be seeing single digit ms as queries in C* generally boil down
> to key/value lookups (partition key) to a list of rows you either point
> query (key/value #2) or range scan via clustering keys and pull back out.
>
> There's also paging to take into consideration (whether you're using it or
> not, what your page size is) and the data itself (do you have thousands of
> columns? Multi-MB blobs you're pulling back out? etc). All can play into
> this.
>
> On Fri, Aug 11, 2023, at 3:40 PM, Jeff Jirsa wrote:
>
> You’re going to have to help us help you
>
> 4.0 is pretty widely deployed. I’m not aware of a perf regression
>
> Can you give us a schema (anonymized) and queries and show us a trace ?
>
>
> On Aug 10, 2023, at 10:18 PM, Shaurya Gupta 
> wrote:
>
> 
> The queries are rightly designed as I already explained. 40 ms is way too
> high as compared to what I seen with other DBs and many a times with
> Cassandra 3.x versions.
> CPU consumed as I mentioned is not high, it is around 20%.
>
> On Thu, Aug 10, 2023 at 5:14 PM MyWorld  wrote:
>
> Hi,
> P95 should not be a problem if rightly designed. Levelled compaction
> strategy further reduces this, however it consume some resources. For read,
> caching is also helpful.
> Can you check your cpu iowait as it could be the reason for delay
>
> Regards,
> Ashish
>
> On Fri, 11 Aug, 2023, 04:58 Shaurya Gupta,  wrote:
>
> Hi community
>
> What is the expected P95 latency for Cassandra Read and Write queries
> executed with Local_Quorum over a table with 3 replicas ? The queries are
> done using the partition + clustering key and row size in bytes is not too
> much, maybe 1-2 KB maximum.
> Assuming CPU is not a crunch ?
>
> We observe those to be 40 ms P95 Reads and same for Writes. This looks
> very high as compared to what we expected. We are using Cassandra 4.0.
>
> Any documentation / numbers will be helpful.
>
> Thanks
> --
> Shaurya Gupta
>
>
>
> --
> Shaurya Gupta
>
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: TOO_MANY_KEY_UPDATES error with TLS

2023-04-12 Thread Elliott Sims via user

Update to this:  per https://github.com/openssl/openssl/issues/8068 it
looks like BoringSSL should avoid this issue, so it may be related to
client behavior of some sort.  It's unclear to me from the message whether
it's intra-cluster traffic or client/cluster traffic generating the error.

On Wed, Apr 12, 2023 at 11:36 AM Elliott Sims  wrote:

> A few weeks ago, we rolled out TLS among hosts in our clusters (running
> 4.0.7).  More recently we also rolled out TLS between Cassandra clients and
> the cluster.  Today, we started seeing a lot of dropped actions in one
> cluster that correlate with warnings like this:
>
> WARN  [epollEventLoopGroup-5-31] 2023-04-12 15:43:34,476
> PreV5Handlers.java:261 - Unknown exception in client networking
>
> io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException:
> error:1104:SSL routines:OPENSSL_internal:TOO_MANY_KEY_UPDATES
>
> at
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:478)
>
> at
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>
> at
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>
> at
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>
> at
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>
> at
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>
> at
> io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>
> at
> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>
> at
> io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>
> at
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>
> at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>
> at java.base/java.lang.Thread.run(Thread.java:829)
>
> Caused by: javax.net.ssl.SSLException: error:1104:SSL
> routines:OPENSSL_internal:TOO_MANY_KEY_UPDATES
>
> at
> io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1028)
>
> at
> io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1321)
>
> at
> io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1270)
>
> at
> io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1346)
>
> at
> io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1389)
>
> at
> io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:206)
>
> at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1387)
>
> at
> io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1294)
>
> at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1331)
>
> at
> io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
>
> at
> io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
>
> ... 15 common frames omitted
>
> INFO  [ScheduledTasks:1] 2023-04-12 15:46:19,701
> MessagingMetrics.java:206 - READ_RSP messages were dropped in last 5000 ms:
> 0 internal and 3 cross node. Mean internal dropped latency: 0 ms and Mean
> cross-node dropped latency: 5960 ms
>
> This looks similar to a bug in OpenSSL fixed in 2019:
> https://github.com/openssl/openssl/pull/8299
> but the equivalent change doesn't seem to have been ported over to
> BoringSSL.  Has anyone else run across this, or have some sort of
> workaround?
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

TOO_MANY_KEY_UPDATES error with TLS

2023-04-12 Thread Elliott Sims via user

A few weeks ago, we rolled out TLS among hosts in our clusters (running
4.0.7).  More recently we also rolled out TLS between Cassandra clients and
the cluster.  Today, we started seeing a lot of dropped actions in one
cluster that correlate with warnings like this:

WARN  [epollEventLoopGroup-5-31] 2023-04-12 15:43:34,476
PreV5Handlers.java:261 - Unknown exception in client networking

io.netty.handler.codec.DecoderException: javax.net.ssl.SSLException:
error:1104:SSL routines:OPENSSL_internal:TOO_MANY_KEY_UPDATES

at
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:478)

at
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)

at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)

at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)

at
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)

at
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)

at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)

at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)

at
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)

at
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)

at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)

at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)

at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)

at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)

at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

at java.base/java.lang.Thread.run(Thread.java:829)

Caused by: javax.net.ssl.SSLException: error:1104:SSL
routines:OPENSSL_internal:TOO_MANY_KEY_UPDATES

at
io.netty.handler.ssl.ReferenceCountedOpenSslEngine.shutdownWithError(ReferenceCountedOpenSslEngine.java:1028)

at
io.netty.handler.ssl.ReferenceCountedOpenSslEngine.sslReadErrorResult(ReferenceCountedOpenSslEngine.java:1321)

at
io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1270)

at
io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1346)

at
io.netty.handler.ssl.ReferenceCountedOpenSslEngine.unwrap(ReferenceCountedOpenSslEngine.java:1389)

at
io.netty.handler.ssl.SslHandler$SslEngineType$1.unwrap(SslHandler.java:206)

at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1387)

at
io.netty.handler.ssl.SslHandler.decodeNonJdkCompatible(SslHandler.java:1294)

at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1331)

at
io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)

at
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)

... 15 common frames omitted

INFO  [ScheduledTasks:1] 2023-04-12 15:46:19,701 MessagingMetrics.java:206
- READ_RSP messages were dropped in last 5000 ms: 0 internal and 3 cross
node. Mean internal dropped latency: 0 ms and Mean cross-node dropped
latency: 5960 ms

This looks similar to a bug in OpenSSL fixed in 2019:
https://github.com/openssl/openssl/pull/8299
but the equivalent change doesn't seem to have been ported over to
BoringSSL.  Has anyone else run across this, or have some sort of
workaround?

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Cassandra on SLES 15?

2023-03-09 Thread Elliott Sims via user

A quick search shows SLES 15 provides Java 11 (java-11-openjdk), which is
just fine for Cassandra 4.x.

On Wed, Mar 8, 2023 at 2:56 PM Eric Ferrenbach <
eric.ferrenb...@milliporesigma.com> wrote:

> We are running Cassandra 4.0.7.
>
> We are preparing to migrate our nodes from Centos to SUSE Linux.
>
>
>
> This page only mentions SLES 12 (not 15)
>
>
> https://cassandra.apache.org/doc/latest/cassandra/getting_started/installing.html
>
>
>
> This states SLES 12 Active support ends next year:
>
> https://endoflife.date/sles
>
>
>
> Does anyone have any information on running Cassandra 4 on SLES 15?
>
> Is this being tested anywhere?
>
>
>
> Thank you in advance,
>
> Eric
>
>
>
>
>
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
>
>
> Click emdgroup.com/disclaimer
>  to
> access the German, French, Spanish, Portuguese, Turkish, Polish and Slovak
> versions of this disclaimer.
>
>
>
> Please find our Privacy Statement information by clicking here: 
> emdgroup.com/privacy-statement
> (U.S.)  or 
> emdserono.com/privacy-statement
> (Canada) 
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Changing tokens between datacenters

2023-01-30 Thread Elliott Sims

For dealing with allocate_tokens_for_keyspace in datacenter migrations,
I've just created a dummy keyspace in the new DC with the desired topology,
then removed it once everything's done.

On Mon, Jan 30, 2023 at 3:36 PM Doug Whitfield 
wrote:

> Hi folks,
>
> In our 3.11 deployments we are using the feature called virtual nodes
> (vnodes).
> So far, we have always used the old default value 256 for the num_tokens
> parameter specified in the cassandra.yaml (see also example file attached),
> as follows:
>
> num_tokens: 256
> # allocate_tokens_for_keyspace: KEYSPACE
> # initial_token:
>
> Due to problems with the repair in bigger topologies (duration and memory
> consumption), we now want to reduce the value of num_tokens to 32 and
> together with this to specify a keyspace used for one of our applications,
> e.g. as follows:
>
> num_tokens: 32
> allocate_tokens_for_keyspace: sb_keyspace
>
>
>
> The specified keyspace with parameter allocate_tokens_for_keyspace should
> feed its replication factor into the automatic allocation algorithm for an
> optimized replicated load over the nodes in the datacenter.
>
> At initial startup there seems to be a chicken-and-egg problem, as none of
> the keyspaces is existing in the finally desired setting.
> But this question here is not about initial startup, but rather about
> modifying an existing cluster with let’s say 2 datacenters currently
> running with the old default value (num_tokens: 256).
>
> To do this, we would temporarily remove one of the datacenters and re-add
> it with the reduced num_tokens and adapted allocate_tokens_for_keyspace.
> Followed by the same operation on the other datacenter.
>
> Main steps (for this case here now) of how we add a datacenter (same as
> described in publicly available information, e.g. by DataStax):
> (1) alter the keyspace definition of all keyspaces (where applicable,
> mainly the keyspaces of our applications) with a RF=0 in the new datacenter
> (2) start up all Cassandra nodes of the new datacenter, one by one
> (3) alter the keyspace definition of all keyspaces with the wanted RF in
> the new datacenter
> (4) perform on each node of the new datacenter: nodetool rebuild
> 
>
>
> But this leads to the following concerns some of our team members have:
> According to the recommended procedure how to add a datacenter, we would
> first define a RF of 0 for the keyspaces and then startup the nodes, which
> means the automatic allocation algorithm would in step (2) prepare the data
> distribution based on this (still) wrong RF, wouldn’t it?
>
> Or would the automatic allocation algorithm kick in at a later step? If
> so, when?
> Do you see anything wrong in the steps we are doing above?
> Do you have any other recommendation, how to perform this wanted change?
>
>
>
> Our testing does not show any errors, but it is a bit difficult to tell if
> things are balanced appropriately with a small amount of data. It could be
> costly to do the testing with a large amount of data. We still need to do
> the testing, but want to make sure we understand what we think should
> happen before we go down that route.
>
> My assumption is that when the rebuild takes place in the rebuild step. I
> took a look at
> https://github.com/apache/cassandra/blob/6da9e33602fad4b8bf9466dc0e9a73665469a195/src/java/org/apache/cassandra/tools/nodetool/Rebuild.java
> and I don’t see an obvious place, but then again, I am not a java developer.
>
> Lastly, I understand that this is much improved in 4.x. I also understand
> that 3.11 will be EOL shortly. Despite repeated attempts by myself to get
> an upgrade approved this isn’t happening at the moment.
>
> So, I guess there are two questions:
> 1. Is it correct that the rebuild does this, and if so, what is the piece
> in the code that does it?
> 2. Does anyone have experience doing this? Are there online instructions
> you used to complete the task? Obviously, we have some from DataStax as
> mentioned, but if there are others we might be able to compare and see
> where the two sets differ. This may give us some clues about our doubts.
>
>
> Best Regards,
>
>
>
> *Douglas Whitfield | Enterprise Architect, OpenLogic
> *
>
>
>
>
>
> This e-mail may contain information that is privileged or confidential. If
> you are not the intended recipient, please delete the e-mail and any
> attachments and notify us immediately.
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Connection Latency with Cassandra 4.0.x

2023-01-11 Thread Elliott Sims

Consistently 200ms, during the back-and-forth negotiation rather than the
handshake?  That sounds suspiciously like Nagle interacting with Delayed
ACK.

On Wed, Jan 11, 2023 at 8:41 AM MyWorld  wrote:

> Hi all,
> We are facing a connection latency of 200ms between API server and db
> server during connection.
> We are working with Apache cassandra 4.0.7 and open jdk ver 11.0.17. We
> are using php on API side and connecting using php Cassandra driver (CPP
> ver 2.7) with below string.
> $cluster = Cassandra::cluster()
>  ->withContactPoints('x.x.x.x')
>  ->withPort(9042)
>  ->withCredentials("user", "pswd")
>  ->withPersistentSessions(true)
>  ->withDefaultConsistency(Cassandra::CONSISTENCY_LOCAL_QUORUM)
>   ->withDatacenterAwareRoundRobinLoadBalancingPolicy("dc1",0, false)
>   ->build();
>   $dbh = $cluster->connect('mykeyspace');
>
> We have earlier worked in 3.11.6 version and things were working fine.
> This is the first time we have installed ver 4.0.x and started facing this
> issue.No change have been done on driver side.
> Just note both API and db server are on same location and ping time to db
> server is less than 1ms.
>
> Unable to identify the root cause of this. Does anyone have any clue?
>
> Regards,
> Ashish
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: SSSD and Cassandra

2022-12-15 Thread Elliott Sims

If multiple things are dying under load, you'll want to check "dmesg" and
see if the oom-killer is getting triggered.  Something like "atop" can be
good for figuring out what was using all of the memory when it was
triggered if the kernel logs don't have enough info.

On Thu, Dec 15, 2022 at 12:41 AM Marc Hoppins  wrote:

> Update: It may be that the load on these hosts is causing problems for
> SSSD not the other way around.  In any case, it seems that both services
> are off at the same time.
>
>
>
> *From:* Marc Hoppins 
> *Sent:* Wednesday, December 14, 2022 10:59 AM
> *To:* user@cassandra.apache.org
> *Subject:* SSSD and Cassandra
>
>
>
> EXTERNAL
>
> Hi all,
>
>
>
> If SSSD stops responding to requests/listening, is this going to cause the
> Cassandra service to shut down?  I didn’t see anything to indicate such
> behaviour in the config, only for disk issues.
>
>
>
> I had two hosts where SSSD was not accepting logins and, after restarting
> that service and login, I noticed that the Cassandra service was also
> stopped.
>
>
>
> Thanks
>
>
>
> M
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Question about num_tokens

2022-08-18 Thread Elliott Sims

I'm not sure I entirely agree with the docs there, as they don't quite
match my experiences, but it's going to depend a lot on your specific needs
and other parts of the configuration.

I think data distribution with low num_tokens is generally considered to be
less of a problem with larger clusters, but I'm not entirely sure that's
true in practice.  You get more of an even-looking distribution of data,
but still a bigger gap between the most and least utilized host and
therefore a need for a larger cluster.

The docs link to a PDF of a study for num_tokens values vs availability
with multiple-node failures.  The gist is that if multiple hosts fail you
may get lucky and those hosts won't contain overlapping token ranges
(therefore avoiding loss of availability).  A lower num_tokens increases
your odds of getting lucky there.  Running NetworkTopology also improves
your odds as well as makes it easier to determine whether a given set of
nodes going offline might affect availability.  It also gives you some
control in terms of reducing the odds of correlated failures on multiple
replicas from things like power or network outages.

I also think current Reaper's token range repairs and intelligence around
consolidating token ranges and safe concurrency limit the downside of
higher num_tokens values.  I've seen pretty good repair performance with
num_tokens 16 and no significant penalty adding in hosts with num_tokens 32
even, though 256 is still significantly slower.  If you're moving to 4.0+
and using incremental repairs, a lower num_tokens value may become
important again.

Anecdotally, running num_tokens 16 and even a mix of num_tokens 16 and 32
has been just fine with multiple clusters over 100 nodes.

On Tue, Aug 16, 2022 at 12:15 PM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> Thanks for the response and details. I am just curious about the below
> statement mentioned in the doc. I am pretty confident that my clusters are
> going to grow to 100+ nodes (same DC or combining all DCs). I am just
> concerned that the doc says it is *not recommended for clusters over 50
> nodes*.
>
> 16
>
> Best for heavily elastic clusters which expand and shrink regularly, but
> may have issues availability with larger clusters. Not recommended for
> clusters over 50 nodes.
>
> On Sun, Mar 13, 2022 at 11:34 PM Elliott Sims 
> wrote:
>
>> More tokens:  better data distribution, more expensive repairs, higher
>> probability of a multi-host outage taking some data offline and affecting
>> availability.
>>
>> I think with >100 nodes the repair times and availability improvements
>> make a strong case for 16 tokens even though it means you'll need more
>> total raw space.
>>
>> Switching from 256 to 16 vnodes definitely will make data distribution
>> worse.  I'm not sure "hot spot" is the right description so much as a wider
>> curve.  I've got one cluster that hasn't been migrated from 256 to 16, and
>> it has about a 6% delta between the smallest and largest nodes instead of
>> more like 20% on the 16-vnode clusters.  The newer
>> allocate_tokens_for_keyspace and (better)
>> allocate_tokens_for_replication_factor options help limit the data
>> distribution issues, but don't totally eliminate them.
>>
>> On the other hand, the 16-vnode cluster takes less than half as long to
>> complete repairs via Reaper.  It also spends more time on GC, though I
>> can't tell whether that's due to vnodes or other differences.
>>
>> On Sun, Mar 13, 2022 at 5:59 PM Jai Bheemsen Rao Dhanwada <
>> jaibheem...@gmail.com> wrote:
>>
>>> Hello Team,
>>>
>>> I am currently using num_tokens: 256 (default in 3.11.X version) for my
>>> clusters and trying to understand the advantages vs disadvantages of
>>> changing it to 16 (I believe 16 is the new recommended value).  As per the 
>>> cassandra
>>> documentation
>>> <https://cassandra.apache.org/doc/latest/cassandra/getting_started/production.html#tokens>
>>>  16
>>> is not recommended for the cluster over 50 nodes.
>>>
>>> Best for heavily elastic clusters which expand and shrink regularly, but
>>>> may have issues availability with larger clusters. Not recommended for
>>>> clusters over 50 nodes.
>>>
>>>
>>> I have a few questions.
>>>
>>>
>>>1. What are the general recommendations for a production cluster
>>>which is > 100 nodes and are heavily elastic in terms of adding and
>>>removing nodes.
>>>2. If I am switching from 256 -> 16 tokens, does this cause any
>>>hotspots by having the data concentrated to only a few

Re: Configuration for new(expanding) cluster and new admins.

2022-06-20 Thread Elliott Sims

If the token value is the same across heterogenous nodes, it means that
each node gets a (roughly) equivalent amount of data and work to do.  So
the bigger servers would be under-utilized.

My answer so far to varied hardware getting out of hand is a periodic
hardware refresh and "datacenter" migration.  Stand up a logical
"datacenter" with all-new uniform denser hardware and a uniform vnode count
(probably 16), migrate to it, tear down the old hardware.

On Thu, Jun 16, 2022 at 12:31 AM Marc Hoppins  wrote:

> Thanks for that info.
>
>
>
> I did see in the documentation that a value of 16 was not recommended for
> >50 hosts. Our existing hbase is 76 regionservers so I would imagine that
> (eventually) we will see a similar figure.
>
>
>
> There will be some scenarios where an initial setup may have (eg) 2 x 8
> HDD and future expansion adds either more HDD or newer nodes with larger
> storage.  It couldn’t be guaranteed that the storage would double but might
> increase by either less than 2x, or 3-4 x existing amount resulting in a
> heterogenous storage configuration.  In these cases how would it affect
> efficiency if the token figure were the same across all nodes?
>
>
>
> *From:* Elliott Sims 
> *Sent:* Thursday, June 16, 2022 12:24 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Configuration for new(expanding) cluster and new admins.
>
>
>
> EXTERNAL
>
> If you set a different num_tokens value for new hosts (the value should
> never be changed on an existing host), the amount of data moved to that
> host will be proportional to the num_tokens value.  So, if the new hosts
> are set to 32 when they're added to the cluster, those hosts will get twice
> as much data as the initial 16-token hosts.
>
> I think it's generally advised to keep a Cassandra cluster identical in
> terms of hardware and num_tokens, at least within a DC.  I suspect having a
> lot of different values would slow down Reaper significantly, but I've had
> decent results so far adding a few hosts with beefier hardware and
> num_tokens=32 to an existing 16-token cluster.
>
>
>
> On Wed, Jun 15, 2022 at 1:33 AM Marc Hoppins 
> wrote:
>
> Hi all,
>
> Say we have 2 datacentres with 12 nodes in each. All hardware is the same.
>
> 4-core, 2 x HDD (eg, 4TiB)
>
> num_tokens = 16 as a start point
>
> If a plan is to gradually increase the nodes per DC, and new hardware will
> have more of everything, especially storage, I assume I increase the
> num_tokens value.  Should I have started with a lower value?
>
> What would be considered as a good adjustment for:
>
> Any increase in number of HDD for any node?
>
> Any increase in capacity per HDD for any node?
>
> Is there any direct correlation between new token count and the
> proportional increase in either quantity of devices or total capacity, or
> is any adjustment purely arbitrary just to differentiate between varied
> nodes?
>
> Thanks
>
> M
>
>
> This email, including its contents and any attachment(s), may contain
> confidential and/or proprietary information and is solely for the review
> and use of the intended recipient(s). If you have received this email in
> error, please notify the sender and permanently delete this email, its
> content, and any attachment(s). Any disclosure, copying, or taking of any
> action in reliance on an email received in error is strictly prohibited.
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Configuration for new(expanding) cluster and new admins.

2022-06-15 Thread Elliott Sims

If you set a different num_tokens value for new hosts (the value should
never be changed on an existing host), the amount of data moved to that
host will be proportional to the num_tokens value.  So, if the new hosts
are set to 32 when they're added to the cluster, those hosts will get twice
as much data as the initial 16-token hosts.

I think it's generally advised to keep a Cassandra cluster identical in
terms of hardware and num_tokens, at least within a DC.  I suspect having a
lot of different values would slow down Reaper significantly, but I've had
decent results so far adding a few hosts with beefier hardware and
num_tokens=32 to an existing 16-token cluster.

On Wed, Jun 15, 2022 at 1:33 AM Marc Hoppins  wrote:

> Hi all,
>
> Say we have 2 datacentres with 12 nodes in each. All hardware is the same.
>
> 4-core, 2 x HDD (eg, 4TiB)
>
> num_tokens = 16 as a start point
>
> If a plan is to gradually increase the nodes per DC, and new hardware will
> have more of everything, especially storage, I assume I increase the
> num_tokens value.  Should I have started with a lower value?
>
> What would be considered as a good adjustment for:
>
> Any increase in number of HDD for any node?
>
> Any increase in capacity per HDD for any node?
>
> Is there any direct correlation between new token count and the
> proportional increase in either quantity of devices or total capacity, or
> is any adjustment purely arbitrary just to differentiate between varied
> nodes?
>
> Thanks
>
> M
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: Topology vs RackDC

2022-06-07 Thread Elliott Sims

In terms of turning it into Ansible, it's going to depend a lot on how you
manage the physical layer as well as replication/consistency.  Currently, I
just use groups per "rack".  If you have an API-accessible CMDB you could
probably pull the physical location from there and translate that to
rack/DC info.  In our case the "rack" is used to control replica locations,
and that actually drives where the hosts will be physically located (trying
to avoid multiple racks/replicas on one switch or power rail)

On Fri, Jun 3, 2022 at 12:20 AM Marc Hoppins  wrote:

> There are cases supporting both sides. I can see the benefits of the more
> dynamic setup.
>
>
>
> However, how do you ansible/automate when you have multiple switches in 2
> or more datacentres and all your nodes are in the same VLAN or VLANs? This
> is the sticking point which I am trying to get to the bottom of.  I am not
> fully au fait with Ansible and we are also using Ans. Tower which allows
> for more flexibility so here should be some practical options.
>
>
>
> *From:* Durity, Sean R 
> *Sent:* Thursday, June 2, 2022 7:04 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Topology vs RackDC
>
>
>
> EXTERNAL
>
> I agree; it does depend. Our ansible could not infer the DC name from the
> hostname or ip address of our on-prem hardware. That’s especially true when
> we are migrating to new hardware or OS and we are adding logical DCs with
> different names. I suppose it could be embedded in the ansible host file
> (but you are still maintaining that master file), but we don’t organize our
> hosts file that way. We are rarely adding a few nodes here or there, so the
> penalty of a rolling restart is minimal for us.
>
>
>
> Sean R. Durity
>
>
>
> INTERNAL USE
>
> *From:* Bowen Song 
> *Sent:* Thursday, June 2, 2022 12:25 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Topology vs RackDC
>
>
>
> It really depends on how do you manage your nodes. With automation tools,
> like Ansible, it's much easier to manage the rackdc file per node. The
> "master list" doesn't need to exist, because the file is written once and
> will never get updated. The automation tool will create nodes based on the
> required DC/rack, and writes that information to the rackdc file during the
> node provisioning process. It's much faster to add nodes to a large cluster
> with rackdc file  - no rolling restart required.
>
> On 02/06/2022 14:46, Durity, Sean R wrote:
>
> I agree with Marc. We use the cassandra-topology.properties file (and
> PropertyFileSnitch) for our deployments. Having a file different on every
> node has never made sense to me. There would still have to be some master
> file somewhere from which to generate that individual node file. There is
> the (slight) penalty that a change in topology requires the distribution of
> a new file and a rolling restart.
>
>
>
> Long live the PropertyFileSnitch! 
>
>
>
> Sean R. Durity
>
> *From:* Paulo Motta  
> *Sent:* Thursday, June 2, 2022 8:59 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Topology vs RackDC
>
>
>
> It think topology file is better for static clusters, while rackdc for
> dynamic clusters where users can add/remove hosts without needing to update
> the topology file on all hosts.
>
>
>
> On Thu, 2 Jun 2022 at 09:13 Marc Hoppins  wrote:
>
> Hi all,
>
> Why is RACKDC preferred for production than TOPOLOGY?
>
> Surely one common file is far simpler to distribute than deal with the
> mucky-muck of various configs for each host if they are in one rack or
> another and/or one datacentre or another?  It is also fairly
> self-documenting of the setup with the entire cluster there in one file.
>
> From what I read in the documentation, regardless of which snitch one
> implements, cassandra-topology.properties will get read, either as a
> primary or as a backup...so why not just use topology for ALL cases?
>
> Thanks
>
> Marc
>
>
>
> INTERNAL USE
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: sstables changing in snapshots

2022-03-23 Thread Elliott Sims

I think this has a much simpler answer:  GNU tar interprets inode changes
as "changes" as well as block contents.  This includes the hardlink count.
I actually ended up working around it by using bsdtar, which doesn't
interpret hardlink count changes as a change to be concerned about.

On Tue, Mar 22, 2022 at 6:56 PM James Brown  wrote:

> I filed https://issues.apache.org/jira/browse/CASSANDRA-17473 for this
> thread as a whole.
>
> Would you like a separate Jira issue on the matter of documenting how to
> tell when a snapshot is "ready"?
>
> James Brown
> Infrastructure Architect @ easypost.com
>
>
> On 2022-03-22 at 17:41:23, Dinesh Joshi  wrote:
>
>> Cassandra creates hardlinks[1] first and then writes the manifest[2]. But
>> that is not the last thing it writes either[3]. This should definitely be
>> documented. Could you please open a jira?
>>
>> [1]
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1956
>> [2]
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1977
>> [3]
>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java#L1981
>>
>> On Mar 22, 2022, at 4:53 PM, James Brown  wrote:
>>
>>
>> There are not overlapping snapshots, so I don't think it's a second
>> snapshot. There are overlapping repairs.
>>
>>
>> > How does the backup process ensure the snapshot is taken before
>> starting to upload it ?
>>
>>
>> It just runs nice nodetool ${jmx_args[@]} snapshot -t "$TAG"
>> ${keyspaces[@]}
>>
>>
>> > A snapshot is only safe to use after the "manifest.json" file is
>> written.
>>
>>
>> Is this true? I don't see this anywhere in the documentation for
>> Cassandra (I would expect it on the Backups page, for example) or in the
>> help of nodetool snapshot. It was my understanding that when the nodetool
>> snapshot process finished, the snapshot was done. If that's wrong, it
>> definitely could be that we're just jumping the gun.
>>
>>
>> James Brown
>>
>> Infrastructure Architect @ easypost.com
>>
>>
>>
>> On 2022-03-22 at 10:38:56, Paul Chandler  wrote:
>>
>> > Hi Yifan,
>>
>> >
>>
>> > It looks like you are right, I can reproduce this, when creating the
>> second snapshot the ctime does get updated to the time of the second
>> snapshot.
>>
>> >
>>
>> > I guess this is what is causing tar to produce the error.
>>
>> >
>>
>> > Paul
>>
>> >
>>
>> >> On 22 Mar 2022, at 17:12, Yifan Cai  wrote:
>>
>> >>
>>
>> >> I am wondering if the cause is tarring when creating hardlinks, i.e.
>> creating a new snapshot.
>>
>> >>
>>
>> >> A quick experiment on my Mac indicates the file status (ctime) is
>> updated when creating hardlink.
>>
>> >>
>>
>> >> ➜ stat -f "Access (atime): %Sa%nModify (mtime): %Sm%nChange (ctime):
>> %Sc" a
>>
>> >> Access (atime): Mar 22 10:03:43 2022
>>
>> >> Modify (mtime): Mar 22 10:03:43 2022
>>
>> >> Change (ctime): Mar 22 10:05:43 2022
>>
>> >>
>>
>> >> On Tue, Mar 22, 2022 at 10:01 AM Jeff Jirsa  wrote:
>>
>> >> The most useful thing that folks can provide is an indication of what
>> was writing to those data files when you were doing backups.
>>
>> >>
>>
>> >> It's almost certainly one of:
>>
>> >> - Memtable flush
>>
>> >> - Compaction
>>
>> >> - Streaming from repair/move/bootstrap
>>
>> >>
>>
>> >> If you have logs that indicate compaction starting/finishing with
>> those sstables, or memtable flushing those sstables, or if the .log file is
>> included in your backup, pasting the contents of that .log file into a
>> ticket will make this much easier to debug.
>>
>> >>
>>
>> >>
>>
>> >>
>>
>> >> On Tue, Mar 22, 2022 at 9:49 AM Yifan Cai  wrote:
>>
>> >> I do not think there is a ticket already. Feel free to create one.
>> https://issues.apache.org/jira/projects/CASSANDRA/issues/
>>
>> >>
>>
>> >> It would be helpful to provide
>>
>> >> 1. The version of the cassandra
>>
>> >> 2. The options used for snapshotting
>>
>> >>
>>
>> >> - Yifan
>>
>> >>
>>
>> >> On Tue, Mar 22, 2022 at 9:41 AM Paul Chandler 
>> wrote:
>>
>> >> Hi all,
>>
>> >>
>>
>> >> Was there any further progress made on this? Did a Jira get created?
>>
>> >>
>>
>> >> I have been debugging our backup scripts and seem to have found the
>> same problem.
>>
>> >>
>>
>> >> As far as I can work out so far, it seems that this happens when a new
>> snapshot is created and the old snapshot is being tarred.
>>
>> >>
>>
>> >> I get a similar message:
>>
>> >>
>>
>> >> /bin/tar:
>> var/lib/cassandra/backup/keyspacename/tablename-4eec3b01aba811e896342351775ccc66/snapshots/csbackup_2022-03-22T14\\:04\\:05/nb-523601-big-Data.db:
>> file changed as we read it
>>
>> >>
>>
>> >> Thanks
>>
>> >>
>>
>> >> Paul
>>
>> >>
>>
>> >>
>>
>> >>
>>
>> >>> On 19 Mar 2022, at 02:41, Dinesh Joshi  wrote:
>>
>> >>>
>>
>> >>> Do you have a repro that you can share with us? If so, please file a
>> jira and we'll take a look.
>>
>> >>>
>>
>>  On Mar 18, 2022, at 12:15 PM, James Brown 
>>

Re: Question about num_tokens

2022-03-14 Thread Elliott Sims

More tokens:  better data distribution, more expensive repairs, higher
probability of a multi-host outage taking some data offline and affecting
availability.

I think with >100 nodes the repair times and availability improvements make
a strong case for 16 tokens even though it means you'll need more total raw
space.

Switching from 256 to 16 vnodes definitely will make data distribution
worse.  I'm not sure "hot spot" is the right description so much as a wider
curve.  I've got one cluster that hasn't been migrated from 256 to 16, and
it has about a 6% delta between the smallest and largest nodes instead of
more like 20% on the 16-vnode clusters.  The newer
allocate_tokens_for_keyspace and (better)
allocate_tokens_for_replication_factor options help limit the data
distribution issues, but don't totally eliminate them.

On the other hand, the 16-vnode cluster takes less than half as long to
complete repairs via Reaper.  It also spends more time on GC, though I
can't tell whether that's due to vnodes or other differences.

On Sun, Mar 13, 2022 at 5:59 PM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> Hello Team,
>
> I am currently using num_tokens: 256 (default in 3.11.X version) for my
> clusters and trying to understand the advantages vs disadvantages of
> changing it to 16 (I believe 16 is the new recommended value).  As per the 
> cassandra
> documentation
> 
>  16
> is not recommended for the cluster over 50 nodes.
>
> Best for heavily elastic clusters which expand and shrink regularly, but
>> may have issues availability with larger clusters. Not recommended for
>> clusters over 50 nodes.
>
>
> I have a few questions.
>
>
>1. What are the general recommendations for a production cluster which
>is > 100 nodes and are heavily elastic in terms of adding and removing
>nodes.
>2. If I am switching from 256 -> 16 tokens, does this cause any
>hotspots by having the data concentrated to only a few nodes and not
>distributing equally across all the nodes?
>
>

-- 
This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review 
and use of the intended recipient(s). If you have received this email in 
error, please notify the sender and permanently delete this email, its 
content, and any attachment(s).  Any disclosure, copying, or taking of any 
action in reliance on an email received in error is strictly prohibited.

Re: gc throughput

2021-11-17 Thread Elliott Sims

CMS has a higher risk of a long stop-the-world full GC that will cause a
burst of timeouts, but if you're not getting that or don't mind if it
happens now and then CMS is probably the way to go.  It's generally
lower-overhead than G1.  If  you really don't care about latency it might
even be worth testing the Parallel collector, but at 16GB there might be
timeouts.

On Wed, Nov 17, 2021 at 6:25 AM onmstester onmstester 
wrote:

> Thank You
> I'm going to achieve the most possible (write) throughput with Cassandra
> and care less about latency, recommendations from community suggests that
> better to use G1GC with 16GB heap, but when i already got 92% throughput
> with CMS, should i consider changing it?
>
> Sent using Zoho Mail 
>
>
>  On Tue, 16 Nov 2021 16:52:29 +0330 *Bowen Song  >* wrote 
>
> Do you have any performance issues? such as long STW GC pauses or high
> p99.9 latency? If not, then you shouldn't tune the GC for the sake of it.
> However, if you do have performance issues related to GC, regardless what
> is the GC metric you are looking at saying, you will need to address the
> issue and that probably will involve some GC tunings.
> On 15/11/2021 06:00, onmstester onmstester wrote:
>
> Hi,
> We are using Apache Cassandra 3.11.2 with its default gc configuration
> (CMS and ...) on a 16GB heap, i inspected gc logs using gcviewer and it
> reported 92% of throughput, is that means not necessary to do any further
> tuning for gc? and everything is ok with gc of Cassandra?
>
>
> Sent using Zoho Mail 
>
>
>
>
>

Re: update cassandra.yaml file on number of cluster nodes

2021-10-18 Thread Elliott Sims

Ansible here as well with a similar setup.  A play at the end of the
playbook that waits until all nodes in the cluster are "UN" before moving
on to the next node to change.

On Mon, Oct 18, 2021 at 10:01 AM vytenis silgalis 
wrote:

> Yep, also use Ansible with configs living in git here.
>
> On Fri, Oct 15, 2021 at 5:19 PM Bowen Song  wrote:
>
>> We have Cassandra on bare-metal servers, and we manage our servers via
>> Ansible. In this use case, we create an Ansible playbook to update the
>> servers one by one, change the cassandra.yaml file, restart Cassandra, and
>> wait for it to finish the restart, and then wait for a few minutes before
>> moving on to the next server.
>> On 15/10/2021 22:42, ZAIDI, ASAD wrote:
>>
>>
>>
>> Hello Folks,
>>
>>
>>
>> Can you guys please suggest tool or approach  to update  cassandra.yaml
>> file in multi-dc environment with large number of nodes efficiently.
>>
>>
>>
>> Thank you.
>>
>> Asad
>>
>>
>>
>>
>>
>>

Re: Migrating Cassandra from 3.11.11 to 4.0.0 vs num_tokens

2021-09-05 Thread Elliott Sims

Won't option 2 in that list potentially cause some pretty severe load
imbalance in most cases?  The last node with 256 tokens will end up with
16x as much data on it as the 16 token nodes, right?

You'd have to mitigate it either by adding 16 new nodes for every one you
replace except the last one, or doing several rounds of replacing every
node with one that has somewhat fewer tokens.

On Sat, Sep 4, 2021, 2:36 AM Erick Ramirez 
wrote:

> It isn't possible to change the tokens on a node once it is already part
> of the cluster. Cassandra won't allow you to do it because it will make the
> data  already on disk unreadable. You'll need to either configure new nodes
> or add a new DC. I've answered an identical question in
> https://community.datastax.com/questions/12213/ where I've provided steps
> for the 2 options. I hope to draft a runbook and get it published on the
> Apache website in the coming days. Cheers!
>

Re: New Servers - Cassandra 4

2021-08-12 Thread Elliott Sims

Depends on your availability requirements, but in general I'd say if you're
going with N replicas, you'd want N failure domains (where one blade
chassis is a failure domain).

On Tue, Aug 10, 2021 at 11:16 PM Erick Ramirez 
wrote:

> That's 430TB of eggs in the one 4U basket so consider that against your
> MTTR requirements. I fully understand the motivation for that kind of
> configuration but *personally*, I wouldn't want to be responsible for its
> day-to-day operation but maybe that's just me. 
>

Re: High memory usage during nodetool repair

2021-08-09 Thread Elliott Sims

Shouldn't cause GCs.

You can usually think of heap memory separately from the rest.  It's
already allocated as far as the OS is concerned, and it doesn't know
anything about GC going on inside of that allocation.  You can set
"-XX:+AlwaysPreTouch" to make sure it's physically allocated on startup.
JVM OOMs when there's not enough memory in the heap, and a system OOM
(invocation of oom-killer) happens when there's not enough memory outside
of the heap.  The kernel will generally pretty aggressively reclaim mmap'd
memory before resorting to oom-killer.

The main con of disabling data mmap is exactly that - reduced read perf and
increased I/O load.

I think to some extent you're solving a non-problem by disabling mmap to
reduce memory utilization.  Unused memory is wasted memory, so there's not
a lot of reasons to avoid using it as a file read cache.  Especially if
you're preallocating JVM memory and not running any other services on that
host.  You probably only want to disable data mmap if your data-to-RAM
ratio is so high that it's just thrashing and not doing anything useful.

On Tue, Aug 3, 2021 at 10:18 AM Amandeep Srivastava <
amandeep.srivastava1...@gmail.com> wrote:

> Thanks. I guess some earlier thread got truncated.
>
> I already applied Erick's recommendations and that seem to have worked in
> reducing the ram consumption by around 50%.
>
> Regarding cheap memory and hardware, we are already running 96GB boxes and
> getting multiple larger ones might be a little difficult at this point.
> Hence I wanted to understand cons of disabling mmap use for data.
>
> Besides degraded read performance, wouldn't we be putting more pressure on
> heap memory, when disabling mmap, which might cause frequent GCs and OOM
> errors at some point? Since currently whatever was being served by mmap
> would be loaded over heap now and processed/stored further.
>
> Also, we've disabled the swap usage on hosts as recommended to optimize
> performance so cass won't be able to enter that too in case memory starts
> to fill up.
>
> On Tue, 3 Aug, 2021, 6:33 pm Jim Shaw,  wrote:
>
>> I think Erick posted https://community.datastax.com/questions/6947/.
>> explained very clearly.
>>
>> We hit same issue only on a huge table when upgrade, and we changed back
>> after done.
>> My understanding,  Which option to chose,  shall depend on your user
>> case. If chasing high performance on a big table, then go default one, and
>> increase capacity in memory, nowadays hardware is cheaper.
>>
>> Thanks,
>> Jim
>>
>> On Mon, Aug 2, 2021 at 7:12 PM Amandeep Srivastava <
>> amandeep.srivastava1...@gmail.com> wrote:
>>
>>> Can anyone please help with the above questions? To summarise:
>>>
>>> 1) What is the impact of using mmap only for indices besides a
>>> degradation in read performance?
>>> 2) Why does the off heap consumed during Cassandra full repair remains
>>> occupied 12+ hours after the repair completion and is there a
>>> manual/configuration driven way to clear that earlier?
>>>
>>> Thanks,
>>> Aman
>>>
>>> On Thu, 29 Jul, 2021, 6:47 pm Amandeep Srivastava, <
>>> amandeep.srivastava1...@gmail.com> wrote:
>>>
 Hi Erick,

 Limiting mmap to index only seems to have resolved the issue. The max
 ram usage remained at 60% this time. Could you please point me to the
 limitations for setting this param? - For starters, I can see read
 performance getting reduced up to 30% (CASSANDRA-8464
 )

 Also if you could please shed light on extended questions in my earlier
 email.

 Thanks a lot.

 Regards,
 Aman

 On Thu, Jul 29, 2021 at 12:52 PM Amandeep Srivastava <
 amandeep.srivastava1...@gmail.com> wrote:

> Thanks, Bowen, don't think that's an issue - but yes I can try
> upgrading to 3.11.5 and limit the merkle tree size to bring down the 
> memory
> utilization.
>
> Thanks, Erick, let me try that.
>
> Can someone please share documentation relating to internal
> functioning of full repairs - if there exists one? Wanted to understand 
> the
> role of the heap and off-heap memory separately during the process.
>
> Also, for my case, once the nodes reach the 95% memory usage, it stays
> there for almost 10-12 hours after the repair is complete, before falling
> back to 65%. Any pointers on what might be consuming off-heap for so long
> and can something be done to clear it earlier?
>
> Thanks,
> Aman
>
>
>

 --
 Regards,
 Aman

>>>

Re: Storing user activity logs

2021-07-19 Thread Elliott Sims

Your partition key determines your partition size.  Reducing retention
sounds like it would help some in your case, but really you'd have to split
it up somehow.  If it fits your query pattern, you could potentially have a
compound key of userid+datetime, or some other time-based split.  You could
also just split each user's rows into subsets with some sort of indirect
mapping, though that can get messy pretty fast.

On Mon, Jul 19, 2021 at 9:01 AM MyWorld  wrote:

> Hi all,
>
> We are currently storing our user activity log in Cassandra with below
> architecture.
>
> Create table user_act_log(
> Userid bigint,
> Datetime bigint,
> Sno UUID,
> some more columns)
> With partition key - userid
> Clustering key - datetime, sno
> And TTL of 6 months
>
> With time our table data have grown to around 500gb and we notice from
> table histogram our max partition size have also grown to tremendous size
> (nearly 1gb)
>
> So, please help me out what should be the right architecture for this use
> case?
>
> I am currently thinking of changing the compaction strategy to time window
> from size tier with 30 day window. But will this improve the partion size?
>
> Should we use any other db for such use case?
>
>
>
>

Re: Soon After Starting c* Process: CPU 100% for java Process

2021-07-01 Thread Elliott Sims

As more general advice, I'd strongly encourage you to update to 3.11.x from
2.2.8.  My personal experience is that it's significantly faster and more
space-efficient, and the garbage collection behavior under pressure is
drastically better.  There's also improved tooling for diagnosing
performance issues.

Also for narrowing down performance issues, I've had good luck with the
"ttop" module of Swiss Java Knife and with the async-profiler tool:
https://github.com/jvm-profiling-tools/async-profiler

On Thu, Jul 1, 2021 at 5:42 AM Fred Habash  wrote:

> Great. Thanks.
>
> I examined the debug logs and from the time c* starts till it crashes the
> 30 minute duration log spitting the exact same message.
>
> Google research leads to no clear answers as to ...
>
> - What is 'evicting cold readers' mean?
> - Why is evicting the same sstable throughout i.e. from start to crash?
> - What is it mean for the AbstractQueryPager to say 'remaining rows to
> page: 2147483646'.
> - How is all the above related?
>
> My Google research revealed a similar inquiry where the response eluded to
> a race condition and recommendation to upgrade to c* 3
>
> Depending on the actual version, you may be running into a race condition
> or NPE that's not allowing the files to close properly. Try upgrading to
> the latest version of 3.x.
>
> In another hit it referenced a page callback.
>
> * page callback that does not have an executor assigned to it*
>
>
> This message repeats from start to crash
> ---
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,542
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> DEBUG [SharedPool-Worker-21] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:95 - Fetched 1 live rows
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
> considering pager exhausted
> DEBUG [SharedPool-Worker-8] 2021-06-30 12:34:51,543
> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
> DEBUG [SharedPool-Worker-2] 2021-06-30 12:34:51,543
> FileCacheService.java:102 - Evicting cold readers for
> /data/cassandra//Y-cf0c43b028e811e68f2b1b695a8d5b2c/
>
> Eventually, the shared pool worker crashes
> --
> b-3223-big-Data.db
> WARN  [SharedPool-Worker-55] 2021-06-30 19:55:41,677
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[SharedPool-Worker-55,5,main]: {}
> java.lang.IllegalArgumentException: Not enough bytes. Offset: 39289.
> Length: 20585. Buffer size: 57984
> at
> org.apache.cassandra.db.composites.AbstractCType.checkRemaining(AbstractCType.java:362)
> ~[apache-cassandra-2.2.8.jar:2.2.8]
>
> On Wed, Jun 30, 2021 at 7:53 PM Kane Wilson  wrote:
>
>> Looks like it's doing a lot of reads immediately on startup
>> (AbstractQueryPager) which is potentially causing a lot of GC (guessing
>> that's what caused the StatusLogger).
>>
>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>> AbstractQueryPager.java:133 - Remaining rows to page: 2147483646
>>
>> is quite suspicious. You'll want to find out what query is causing a
>> massive scan at startup, you probably need to have a look through the start
>> of the logs to get a better idea at what's happening at startup.
>>
>> On Thu, Jul 1, 2021 at 5:14 AM Fred Habash  wrote:
>>
>>> I have node in cluster when I start c, the cpu reaches 100% with java
>>> process on top. Within a few minutes, jvm crashes (jvm instability)
>>> messages in system.log and c* crashes.
>>>
>>> Once c* is up, cluster average read latency reaches multi-seconds and
>>> client apps are unhappy. For now, the only way out is to drain the node and
>>> let the cluster latency settle.
>>>
>>> None of these measures helped ...
>>> 1. Rebooting the ec2
>>> 2. Replacing the ec2 altogether (new ec2/ new c* install/ etc).
>>> 3. Stopping compactions (as a diagnostic measure)
>>> Trying to understand why the java process is chewing much cpu i.e. what
>>> is actually happening ...
>>>
>>> I see these error messages in the debug.log. What functional task do
>>> these messages relate to e.g. compactions?
>>>
>>>
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:95 - Fetched 1 live rows
>>> DEBUG [SharedPool-Worker-113] 2021-06-30 13:39:04,766
>>> AbstractQueryPager.java:112 - Got result (1) smaller than page size (5000),
>>> considering pager exhausted
>>> INFO  [Service Thread] 2021-06-30 13:39:04,766 StatusLogger.java:56 -
>>> MemtablePostFlush 0 0 29 0
>>> 0
>>>
>>> DEBUG

Re: Huge single-node DCs (?)

2021-04-08 Thread Elliott Sims

I'm not sure I'd suggest building a single DIY Backblaze pod.  The SATA
port multipliers are a pain both from a supply chain and systems management
perspective.  Can be worth it when you're amortizing that across a lot of
servers and can exert some leverage over wholesale suppliers, but less so
for a one-off.  There's a lot more whitebox/OEM/etc options for
high-density storage servers these days from Seagate, Dell, HP, Supermicro,
etc that are worth a look.

I'd agree with this (both examples) sounding like a poor fit for
Cassandra.  Seems like you could always just spin up a bunch of Cassandra
VMs in the ESX cluster instead of one big one, but something like MySQL or
PostgreSQL might suit your needs better.  Or even some sort of flatfile
archive with something like Parquet if it's more being kept "just in case"
with no need for quick random access.

For the 10PB example, it may be time to look at something like Hadoop, or
maybe Ceph.

On Thu, Apr 8, 2021 at 10:39 AM Bowen Song  wrote:

> This is off-topic. But if your goal is to maximise storage density and
> also ensuring data durability and availability, this is what you should be
> looking at:
>
>- hardware:
>https://www.backblaze.com/blog/open-source-data-storage-server/
>- architecture and software:
>https://www.backblaze.com/blog/vault-cloud-storage-architecture/
>
>
> On 08/04/2021 17:50, Joe Obernberger wrote:
>
> I am also curious on this question.  Say your use case is to store
> 10PBytes of data in a new server room / data-center with new equipment,
> what makes the most sense?  If your database is primarily write with little
> read, I think you'd want to maximize disk space per rack space.  So you may
> opt for a 2u server with 24 3.5" disks at 16TBytes each for a node with
> 384TBytes of disk - so ~27 servers for 10PBytes.
>
> Cassandra doesn't seem to be the good choice for that configuration; the
> rule of thumb that I'm hearing is ~2Tbytes per node, in which case we'd
> need over 5000 servers.  This seems really unreasonable.
>
> -Joe
>
> On 4/8/2021 9:56 AM, Lapo Luchini wrote:
>
> Hi, one project I wrote is using Cassandra to back the huge amount of data
> it needs (data is written only once and read very rarely, but needs to be
> accessible for years, so the storage needs become huge in time and I chose
> Cassandra mainly for its horizontal scalability regarding disk size) and a
> client of mine needs to install that on his hosts.
>
> Problem is, while I usually use a cluster of 6 "smallish" nodes (which can
> grow in time), he only has big ESX servers with huge disk space (which is
> already RAID-6 redundant) but wouldn't have the possibility to have 3+
> nodes per DC.
>
> This is out of my usual experience with Cassandra and, as far as I read
> around, out of most use-cases found on the website or this mailing list, so
> the question is:
> does it make sense to use Cassandra with a big (let's talk 6TB today, up
> to 20TB in a few years) single-node DataCenter, and another single-node
> DataCenter (to act as disaster recovery)?
>
> Thanks in advance for any suggestion or comment!
>
>

Re: Cassandra video tutorials for administrators.

2021-03-17 Thread Elliott Sims

I'm a big fan of this one about LWTs:
https://www.youtube.com/watch?v=wcxQM3ZN20c
Not only if you want to understand LWTs, but also to get a better
understanding of the sometimes-unintuitive consistency promises made and
not made for non-LWT queries.

On Tue, Mar 16, 2021 at 11:53 PM  wrote:

> I know there is a lot of useful information out there, including on you
> tube. I am looking for recommendations for good introductory (but detailed)
> videos created by people who have cassandra cluster management, that
> outline all the day to day activities someone who is managing a cluster
> would understand and/or be doing.
>
> I believe I have a reasonably good grasp of Cassandra, but I am in that “I
> don’t know what I don’t know phase” where there might be things that I am
> unaware I should understand.
>
> Thanks,
> Justine
>

Re: What Happened To Alternate Storage And Rocksandra?

2021-03-12 Thread Elliott Sims

I'm not too familiar with the details on what's happened more recently, but
I do remember that while Rocksandra was very favorably compared to
Cassandra 2.x, the improvements looked fairly similar in nature and
magnitude to what Cassandra got from the move to the 3.x sstable format and
increased use of off-heap memory.  That might have damped a lot of the
enthusiasm for further development.

On Fri, Mar 12, 2021 at 10:50 AM Gareth Collins 
wrote:

> Hi,
>
> I remember a couple of years ago there was some noise about Rocksandra
> (Cassandra using rocksdb for storage) and opening up Cassandra to alternate
> storage mechanisms.
>
> I haven't seen anything about it for a while now though. The last commit
> to Rocksandra on github was in Nov 2019. The associated JIRA items
> (CASSANDRA-13474 and CASSANDRA-13476) haven't had any activity since 2019
> either.
>
> I was wondering whether anyone knew anything about it. Was it decided that
> this wasn't a good idea after all (the alleged performance differences
> weren't worth it...or were exaggerated)? Or is it just that it still may be
> a good idea, but there are no resources available to make this happen (e.g.
> perhaps the original sponsor moved onto other things)?
>
> I ask because I was looking at RocksDB/Kafka Streams for another project
> (which may replace some functionality which currently uses Cassandra)...and
> was wondering if there could be some important info about RocksDB I may be
> missing.
>
> thanks in advance,
> Gareth Collins
>

Re: Setting DC in different geographical location

2021-01-27 Thread Elliott Sims

TO start, I'd try to figure out what your slowdown is.  Surely GCP has far,
far more than 17Mbps available.
You don't want to cut it close on this, because for stuff like repairs,
rebuilds, interruptions, etc you'll want to be able to catch up and not
just keep up.
Generally speaking, Cassandra defers a lot of work and if you get behind
when you're already at the limit of performance it's going to deteriorate
badly.

Whether it's synchronous or async will depend on the query type for the
write (ALL, LOCAL_QUORUM, etc)

On Tue, Jan 26, 2021 at 6:37 PM MyWorld  wrote:

> Hi,
> We have a cluster with one Data Center of 3 nodes in GCP-US(RF=3).Current
> apache cassandra version 3.11.6. We are planning to add one new Data Center
> of 3 nodes in GCP-India.
>
> At peak hours, files generation in commit logs at GCP-US side on one node
> is around 1 GB per minute (i.e 17+ mbps).
>
> Currently the file transfer speed from GCP US to India is 9 mbps.
>
> So, with this speed, is it possible in cassandra to perform asynchronous
> write in new DC(India)?
> Also, is there any compression technique which cassandra applies while
> transferring data across DC?
>
> *My assumption *: All 3 coordinator nodes in US will be responsible for
> transfering 1/3rd data to new DC. So, at peak time only 1GB/3 is what each
> node has to sync.
> Please let me know is my assumption right? If yes, what will happen if
> data generated in commit log per node increase to 3 GB per minute tomorrow.
>
> Regards,
> Ashish
>
>

Re: strange behavior of counter tables after losing a node

2021-01-27 Thread Elliott Sims

To start with, maybe update to beta4.  There's an absolute massive list of
fixes since alpha4.  I don't think the alphas are expected to be in a
usable/low-bug state necessarily, where beta4 is approaching RC status.

On Tue, Jan 26, 2021, 10:44 PM Attila Wind  wrote:

> Hey All,
>
> I'm coming back on my own question (see below) as this has happened again
> to us 2 days later so we took the time to further analyse this issue. I'd
> like to share our experiences and the workaround which we figured out too.
>
> So to just quickly sum up the most important details again:
>
>- we have a 3 nodes cluster - Cassandra 4-alpha4 and RF=2 - in one DC
>- we are using ONE consistency level in all queries
>- if we lose one node from the cluster then
>   - non-counter table writes are fine, remaining 2 nodes taking over
>   everything
>   - but counter table writes start to fail with exception
>   "com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra
>   timeout during COUNTER write query at consistency ONE (1 replica were
>   required but only 0 acknowledged the write)"
>   - the two remaining nodes are both producing hints files for the
>   fallen one
>- just a note: counter_write_request_timeout_in_ms = 1,
>write_request_timeout_in_ms = 5000 in our cassandra.yaml
>
> To test this further bit we did the following:
>
>- we shut down one of the nodes normally
>In this case we do not have the above behavior - everything happens as
>it should, no failures on counter table writes
>so this is good
>- we reproduced the issue in our TEST env by hard-killing one of the
>nodes instead of normal shutdown (simulating a hardware failure as we had
>in PROD)
>Bingo, issue starts immediately!
>
> Based on the above observations the "normal shutdown - no problem" case
> gave an idea - so now we have a workaround how to get back the cluster into
> a working state in a case if we would lose a node permanently (or for a
> long time at least)
>
>1. (in our case) we stop the App to stop all Cassandra operations
>2. stop all remaining nodes in the cluster normally
>3. restart them normally
>
> This way the remaining nodes realize the failed node is down and they are
> jumping into expected processing - everything works including counter table
> writes
>
> If anyone has any idea what to check / change / do in our cluster I'm all
> ears! :-)
>
> thanks
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
> 22.01.2021 07:35 keltezéssel, Attila Wind írta:
>
> Hey guys,
>
> Yesterday we had an outage after we have lost a node and we saw such a
> behavior we can not explain.
>
> Our data schema has both: counter and norma tables. And we have
> replicationFactor = 2 and consistency level LOCAL_ONE (explicitly set)
>
> What we saw:
> After a node went down the updates of the counter tables slowed down. A
> lot! These updates normally take only a few millisecs but now started to
> take 30-60 seconds(!)
> At the same time the write ops against non-counter tables did not show any
> difference. The app log was silent in a sense of errors. So the queries -
> including the counter table updates - were not failing (otherwise we see
> exceptions coming from DAO layer originating from Cassandra driver) at all.
> One more thing: only those updates suffered from the above huuuge wait
> time where the lost node was involved (due to partition key). Other updates
> just went fine
>
> The whole stuff looks like Cassandra internally started to wait - a lot -
> for the lost node. Updates finally succeeded without failure - at least for
> the App (the client)
>
> Did anyone ever experienced similar behavior?
> What could be an explanation for the above?
>
> Some more details: the App is implemented in Java 8, we are using Datastax
> driver 3.7.1 and server cluster is running on Cassandra 4.0 alpha 4.
> Cluster size is 3 nodes.
>
> Any feedback is appreciated! :-)
>
> thanks
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>

Re: Cassandra on ZFS: disable compression?

2021-01-26 Thread Elliott Sims

The main downside I see is that you're hitting a less-tested codepath.  I
think very few installations have compression disabled today.

On Mon, Jan 25, 2021 at 7:06 AM Lapo Luchini  wrote:

> Hi,
>  I'm using a fairly standard install of Cassandra 3.11 on FreeBSD
> 12, by default filesystem is compressed using LZ4 and Cassandra tables
> are compressed using LZ4 as well.
>
> I was wondering if anybody had data about this already (or else, I will
> probably do some tests myself, eventually): would it be a nice idea to
> disable Cassandra compression and rely only on ZFS one?
>
> In principle I can see some pros:
> 1. it's done in kernel, might be slightly faster
> 2. can (probably) compress more data, as I see a 1.02 compression factor
> on filesystem even if I have compressed data in tables already
> 3. in upcoming ZFS version I will be able to use Zstd compression
> (probably before Cassandra 4.0 is gold)
> 4. (can inspect  compression directly at filesystem level)
>
> But on the other hand application-level compression could have its
> advantages.
>
> cheers,
>
> --
> Lapo Luchini
> l...@lapo.it
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Node configuration and capacity

2021-01-13 Thread Elliott Sims

1% packet loss can definitely lead to drops.  At higher speeds, that's
enough to limit TCP throughput to the point that cross-node communication
can't keep up.  TCP_BBR will do better than other strategies at maintaining
high throughput despite single-digit packet loss, but you'll also want to
track down the actual cause.

I'd be a bit hesitant to tune the transport threads any further until
you've solved the packet loss problem.

On Wed, Jan 13, 2021 at 8:53 AM MyWorld  wrote:

> Hi,
>
> We are currently using apache cassandra 3.11.6 in our production
> environment with single DC of 4 nodes.
>
> 2 nodes have configuration : Ssd 24 cores, 64gb ram, 20gb heap size
>
> Other 2 nodes have: Ssd 32cores, 64gb ram, 20gb heap size
>
> I have several questions around this.
>
> 1. Does different configuration nodes(cores) in single dc have any impact ?
>
> 2. Can we have different heap size in single DC on different nodes?
>
> 3. Which is better : single partition disk or multiple partition disk?
>
> 4. Currently we have 200 writes and around 5000 reads per sec per node (In
> 4 node cluster). How to determine max node capacity?
>
> 5. We are getting read/write operation timeout intermittently. There is no
> GC issue. However we have observed 1% packet loss between nodes. Can this
> be the cause of timeout issue?
>
> 6. Currently we are getting 1100 established connections from client side.
> Shall we increase native_transport_max_threads to 1000+? Currently we have
> increased it from default 128 to 512 after finding pending NTR requests
> during timeout issue.
>
> 7. Have found below h/w production recommendation from dse site. How much
> this is helpful for apache cassandra ?
>
> net.ipv4.tcp_keepalive_time=60
> net.ipv4.tcp_keepalive_probes=3
> net.ipv4.tcp_keepalive_intvl=10
> net.core.rmem_max=16777216
> net.core.wmem_max=16777216
> net.core.rmem_default=16777216
> net.core.wmem_default=16777216
> net.core.optmem_max=40960
> net.ipv4.tcp_rmem=4096 87380 16777216
> net.ipv4.tcp_wmem=4096 65536 16777216
>
>

Re: kill session in cassandra cluster

2021-01-06 Thread Elliott Sims

At least by default, Cassandra has pretty short timeouts.  I don't know of
a way to kill an in-flight query, but by the time you did it would have
timed out anyways.  I don't know of any way to stop it from repeating other
than tracking down the source and stopping it.

On Wed, Jan 6, 2021, 5:41 PM David Ni  wrote:

> Hello,Experts!
>  I want to know if there is a way to kill the session in cassandra
> cluster,for example,I get session_id from
> system_traces.sessions:4c9049a0-4fed-11eb-a60d-7f98ffdaf6cd,the session is
> running with very bad cql which causing bad performance,I need to kill it
> ASAP,could anyone help,thanks very much!
>
>
>
>

Re: Repairs on table with daily full load

2020-12-17 Thread Elliott Sims

Are you running with RF=3 and QUORUM on both read and write?
If so, I think as long as your fill job reports errors and retries you can
probably get away without repairing.
You can also hedge your bets by doing the data load with ALL, though of
course that has an availability tradeoff.

Personally, I'd probably look at running the initial load with ALL, falling
back on QUORUM and recording which data had to fall back.  That way you'll
know if there were inconsistencies and can correct them manually (full
repair or rebuild of a host that was down, or replaying the write with ALL
later), but without adding significant overhead to the process.

On Wed, Dec 16, 2020 at 12:43 AM Maxim Parkachov 
wrote:

> Hi everyone,
>
> There are a lot of articles, and, probably this question was asked already
> many times, but I still not 100% sure.
>
> We have a table, which we load almost full every night with spark job and
> consistency LOCAL_QUORUM and record TTL 7 days. This is to remove some
> records if they are not present in last 7 imports. Table is located in 2
> DCs. We are interested only in the last record state. Definition of the
> table below. After the load, we are running repair with reaper on this
> table, which takes lot of time and resources. We have multiple such tables
> and most of the repair time is busy with such tables. Running full load
> again takes less time than repair on this table.
>
> Question is: Do we, actually, need to run repairs on this table at all ?
> If yes, how offten, daily, weekly ?
>
> Thanks in advance,
> Maxim.
>
> WITH bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
> AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 1.0
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
>

Re: Vastly different disk I/O on different sized aws instances

2020-12-02 Thread Elliott Sims

Is the heap larger on the M5.4x instance?
Are you sure it's Cassandra generating the read traffic vs just evicting
files read by other systems?

In general, I'd call "more RAM means fewer drive reads" a very expected
result regardless of the details, especially when it's the difference
between fitting the whole data-set in RAM and not, so I'm not sure it's
worth doing that much digging.


On Wed, Dec 2, 2020 at 8:41 AM Carl Mueller
 wrote:

> Oh, this is cassandra 2.2.13 (multi tenant delays) and ubuntu 18.04.
>
> On Wed, Dec 2, 2020 at 10:35 AM Carl Mueller 
> wrote:
>
>> We have a cluster that is experiencing very high disk read I/O in the
>> 20-40 MB/sec range on m5.2x (gp2 drives). This is verified via VM metrics
>> as well as iotop.
>>
>> When we switch m5.4x it drops to 60 KB/sec.
>>
>> There is no difference in network send/recv, read/write request counts.
>>
>> The graph for read kb/sec mirrors the cpu.iowait.
>>
>> Compaction would have similar writes to go with reads as the sstables
>> were written. Flushing would be almost all writes. Swappiness is zero.
>>
>> I have done inotifywait to compare read volume on the data and log dirs.
>> They are roughly equivalent.
>>
>> File Caching could be a candidate, I used tobert's :
>> https://github.com/tobert/pcstat to see what files are in the file
>> cache, and that listed all files at 100%, I would think an overloaded file
>> cache would have different files swapping into the cache and partials on
>> the data files (data density for the node is about 30 GB).
>>
>> iotop indicates all the read traffic is from cassandra threads.
>>
>> Anyone have similar experiences?
>>
>

Re: Enable Ttracing

2020-11-16 Thread Elliott Sims

Tracing fully on rather than sampling will definitely add substantial load,
even with shorter TTLs.  That's a lot of extra writes.

If it's just on for specific sessions, or is enabled but with low sampling,
that's not bad in terms of load.

On Mon, Nov 16, 2020 at 6:25 AM Shalom Sagges 
wrote:

> Hi Guys,
>
> Our Service team would like to add a 3rd party tool (AppDynamics) that
> will monitor Cassandra.
> This tool will get read permissions on the system_traces keyspace and also
> needs to enable TRACING.
> tracetype_query_ttl in the yaml file will be reduced from 24 hours to 5
> minutes.
> I feel and fear that using TRACING ON constantly will add more pressure on
> the cluster.
> Am I correct with this feeling or is it safe to use tracing on a regular
> basis?
>
> Thanks!
>

Re: How to know if we need to increase heap size?

2020-08-20 Thread Elliott Sims

You want to look for full or long GCs in the logs, as well as how much
total time it's spending on GCing as a percentage.  Probably more the
latter, since you're not seeing long pauses with one core pegged and the
rest idle.  G1 handles oversized heaps well, so it's worth bumping to
20-27GB just to see what happens.

If it's not GC, then you're just running out of CPU and need more, or need
to figure out what queries are killing it.

On Thu, Aug 20, 2020 at 10:45 AM Lee Tewksbury  wrote:

> Depending on your thread count, you can consider increasing the max native
> transport threads and concurrent reads. But the keys to Cassandra are
> pretty make good data, make good queries, and if you can't keep up, double
> the cluster size. If you're following the documentation on heap size (1/2
> RAM or 20GB, whichever is lower) then I would suggest increasing threads
> but more importantly increasing node count.
>
> On Thu, Aug 20, 2020 at 10:33 AM Krish Donald 
> wrote:
>
>> Hi,
>>
>> We have a cluster where if reads are increased 2-3 times suddenly then
>> cassandra cpu goes around 100% (We have 48 cpu machines with 128GB RAM) for
>> few nodes and cassandra becomes unresponsive .
>> We are on 3.11.5 and using G1GC with 16GB heap size.
>> When going through the system.logs and gc.log , i see in system.log it is
>> just printing messages like below every 5 secs. I have removed lines for
>> many keyspaces to reduce the size of the text. , and lot of messages are
>> getting printed in gc.log . I feel that may be i need to increase heap size
>> on these nodes but i wanted to understand , how do we determine if heap
>> size should be increased or not. Nodes are not dying due to OOMs . When we
>> have OOMs , we know for sure we need to increase heap size but *what to
>> see in gc.log , system.log and debug.log to determine if we have to
>> increase heap size.*
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,368
>> MessagingService.java:1246 - READ messages were dropped in last 5000 ms:
>> 199 internal and 232 cross node. Mean internal dropped latency: 10443 ms
>> and Mean cross-node dropped latency: 10402 ms
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,369 StatusLogger.java:47 -
>> Pool NameActive   Pending  Completed   Blocked  All
>> Time Blocked
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,377 StatusLogger.java:51 -
>> MutationStage 0 0   80051890 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,378 StatusLogger.java:51 -
>> ViewMutationStage 0 0  0 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,378 StatusLogger.java:51 -
>> ReadStage   192  1331  152624049 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,378 StatusLogger.java:51 -
>> RequestResponseStage  0 0  172822890 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,378 StatusLogger.java:51 -
>> ReadRepairStage   0 01545869 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,379 StatusLogger.java:51 -
>> CounterMutationStage  0 0  0 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,379 StatusLogger.java:51 -
>> MiscStage 0 0  0 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,379 StatusLogger.java:51 -
>> CompactionExecutor0 0 623536 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,379 StatusLogger.java:51 -
>> MemtableReclaimMemory 0 0   6700 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,380 StatusLogger.java:51 -
>> PendingRangeCalculator0 0 18 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,380 StatusLogger.java:51 -
>> GossipStage   0 01613366 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,380 StatusLogger.java:51 -
>> SecondaryIndexManagement  0 0  0 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,380 StatusLogger.java:51 -
>> HintsDispatcher   0 0  5 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,381 StatusLogger.java:51 -
>> MigrationStage0 0  1 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,381 StatusLogger.java:51 -
>> MemtablePostFlush 0 0  14830 0
>> 0
>>
>> INFO  [ScheduledTasks:1] 2020-08-19 08:13:12,381 StatusLogger.java:51 -
>> PerDiskMemtableFlushWriter_0 0 0   6700 0
>>

Re: Difference in num_tokens between Cassandra 2 and 3?

2020-08-08 Thread Elliott Sims

I've found there to be some behavior differences in practice as well going
from 2.2 to 3.11 with a high token count, but all differences for the
better.  3.x seems noticeably less likely to crater or GC-thrash during
repairs compared to 2.x, probably due to the sum of small changes rather
than any one in particular.

On Thu, Aug 6, 2020 at 4:54 PM Leon Zaruvinsky 
wrote:

> Hi,
>
> I'm currently investigating an upgrade for our Cassandra cluster from 2.2
> to 3.11, and as part of that would like to understand if there is any
> change in how the cluster behaves w.r.t number of tokens.  For historical
> reasons, we have num_tokens set very high but want to make sure that this
> is not more dangerous in a later version.
>
> I've read recent threads on the new default, and the Netflix whitepaper,
> so I'm fairly comfortable with the pros/cons of various token counts - but
> specifically am interested about the difference in behavior between
> Cassandra major versions, if one exists.
>
> Thanks,
> Leon
>

Re: Generating evenly distributed tokens for vnodes

2020-05-27 Thread Elliott Sims

There's also a slightly older mailing list discussion on this subject that
goes into detail on this sort of strategy:
https://www.mail-archive.com/user@cassandra.apache.org/msg60006.html

I've been approximately following it, repeating steps 3-6 for the first
host in each "rack(replica, since I have 3 racks and RF=3) then 8-10 for
the remaining hosts in the new datacenter.  So far, so good (sample size of
1) but it's a pretty painstaking process

This should get a lot simpler with Cassandra 4+'s
"allocate_tokens_for_local_replication_factor" option, which will default
to 3.

On Wed, May 27, 2020 at 4:34 AM Kornel Pal  wrote:

> Hi,
>
> Generating ideal tokens for single-token datacenters is well understood
> and documented, but there is much less information available on
> generating tokens with even ownership distribution when using vnodes.
> The best description I could find on token generation for vnodes is
>
> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>
> While allocate_tokens_for_keyspace results in much more even ownership
> distribution than random allocation, and does a great job at balancing
> ownership when adding new nodes, using it for creating a new datacenter
> results in less than ideal ownership distribution.
>
> After some experimentation, I found that it is possible to generate all
> the tokens for a new datacenter with an extended version of the Python
> script presented in the above blog post. Using these tokens seem to
> result in perfectly even ownership distribution with various
> token/node/rack configurations for all possible replication factors.
>
> Murmur3Partitioner:
>  >>> datacenter_offset = 0
>  >>> num_tokens = 4
>  >>> num_racks = 3
>  >>> num_nodes = 3
>  >>> print "\n".join(['[Rack #{}, Node #{}] initial_token: {}'.format(r
> + 1, n + 1, ','.join([str(((2**64 / (num_tokens * num_nodes *
> num_racks)) * (t * num_nodes * num_racks + n * num_racks + r)) - 2**63 +
> datacenter_offset) for t in range(num_tokens)])) for r in
> range(num_racks) for n in range(num_nodes)])
> [Rack #1, Node #1] initial_token:
> -9223372036854775808,-4611686018427387908,-8,4611686018427387892
> [Rack #1, Node #2] initial_token:
>
> -7686143364045646508,-3074457345618258608,1537228672809129292,6148914691236517192
> [Rack #1, Node #3] initial_token:
>
> -6148914691236517208,-1537228672809129308,3074457345618258592,7686143364045646492
> [Rack #2, Node #1] initial_token:
>
> -8710962479251732708,-4099276460824344808,512409557603043092,5124095576030430992
> [Rack #2, Node #2] initial_token:
>
> -7173733806442603408,-2562047788015215508,2049638230412172392,6661324248839560292
> [Rack #2, Node #3] initial_token:
>
> -5636505133633474108,-1024819115206086208,3586866903221301692,8198552921648689592
> [Rack #3, Node #1] initial_token:
>
> -8198552921648689608,-3586866903221301708,1024819115206086192,5636505133633474092
> [Rack #3, Node #2] initial_token:
>
> -6661324248839560308,-2049638230412172408,2562047788015215492,7173733806442603392
> [Rack #3, Node #3] initial_token:
>
> -5124095576030431008,-512409557603043108,4099276460824344792,8710962479251732692
>
> RandomPartitioner:
>  >>> datacenter_offset = 0
>  >>> num_tokens = 4
>  >>> num_racks = 3
>  >>> num_nodes = 3
>  >>> print "\n".join(['[Rack #{}, Node #{}] initial_token: {}'.format(r
> + 1, n + 1, ','.join([str(((2**127 / (num_tokens * num_nodes *
> num_racks)) * (t * num_nodes * num_racks + n * num_racks + r)) +
> datacenter_offset) for t in range(num_tokens)])) for r in
> range(num_racks) for n in range(num_nodes)])
> [Rack #1, Node #1] initial_token:
>
> 0,42535295865117307932921825928971026427,85070591730234615865843651857942052854,127605887595351923798765477786913079281
> [Rack #1, Node #2] initial_token:
>
> 14178431955039102644307275309657008809,56713727820156410577229101238628035236,99249023685273718510150927167599061663,141784319550391026443072753096570088090
> [Rack #1, Node #3] initial_token:
>
> 28356863910078205288614550619314017618,70892159775195513221536376548285044045,113427455640312821154458202477256070472,155962751505430129087380028406227096899
> [Rack #2, Node #1] initial_token:
>
> 4726143985013034214769091769885669603,47261439850130342147690917698856696030,89796735715247650080612743627827722457,132332031580364958013534569556798748884
> [Rack #2, Node #2] initial_token:
>
> 18904575940052136859076367079542678412,61439871805169444791998193008513704839,103975167670286752724920018937484731266,146510463535404060657841844866455757693
> [Rack #2, Node #3] initial_token:
>
> 33083007895091239503383642389199687221,75618303760208547436305468318170713648,118153599625325855369227294247141740075,160688895490443163302149120176112766502
> [Rack #3, Node #1] initial_token:
>
> 9452287970026068429538183539771339206,51987583835143376362460009468742365633,94522879700260684295381835397713392060,137058175565377992228303661326684418487
> [Rack #3, Node #2] initial_token:
>
>

Re: Issues, understanding how CQL works

2020-04-21 Thread Elliott Sims

The short answer is that CQL isn't SQL.  It looks a bit like it, but the
structure of the data is totally different.  Essentially (ignoring
secondary indexes, which have some issues in practice and I think are
generally not recommended) the only way to look the data up is by the
partition key.  Anything else is a full-table scan and if you need more
querying flexibility Cassandra is probably not your best option.   With
only 260GB, I think I'd lean towards suggesting PostgreSQL or MySQL.

On Tue, Apr 21, 2020 at 7:20 AM Marc Richter  wrote:

> Hi everyone,
>
> I'm very new to Cassandra. I have, however, some experience with SQL.
>
> I need to extract some information from a Cassandra database that has
> the following table definition:
>
> CREATE TABLE tagdata.central (
> signalid int,
> monthyear int,
> fromtime bigint,
> totime bigint,
> avg decimal,
> insertdate bigint,
> max decimal,
> min decimal,
> readings text,
> PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
> )
>
> The database is already of round about 260 GB in size.
> I now need to know what is the most recent entry in it; the correct
> column to learn this would be "insertdate".
>
> In SQL I would do something like this:
>
> SELECT insertdate FROM tagdata.central
> ORDER BY insertdate DESC LIMIT 1;
>
> In CQL, however, I just can't get it to work.
>
> What I have tried already is this:
>
> SELECT insertdate FROM "tagdata.central"
> ORDER BY insertdate DESC LIMIT 1;
>
> But this gives me an error:
> ERROR: ORDER BY is only supported when the partition key is restricted
> by an EQ or an IN.
>
> So, after some trial and error and a lot of Googling, I learned that I
> must include all rows from the PRIMARY KEY from left to right in my
> query. Thus, this is the "best" I can get to work:
>
>
> SELECT
> *
> FROM
> "tagdata.central"
> WHERE
> "signalid" = 4002
> AND "monthyear" = 201908
> ORDER BY
> "fromtime" DESC
> LIMIT 10;
>
>
> The "monthyear" column, I crafted like a fool by incrementing the date
> one month after another until no results could be found anymore.
> The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
> query results. But these can't be as easily guessed as the "monthyear"
> values could.
>
> This is where I'm stuck!
>
> 1. This does not really feel like the ideal way to go. I think there is
> something more mature in modern IT systems. Can anyone tell me what is a
> better way to get these informations?
>
> 2. I need a way to learn all values that are in the "monthyear" and
> "signalid" columns in order to be able to craft that query.
> How can I achieve that in a reasonable way? As I said: The DB is round
> about 260 GB which makes it next to impossible to just "have a look" at
> the output of "SELECT *"..
>
> Thanks for your help!
>
> Best regards,
> Marc Richter
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Multi DC replication between different Cassandra versions

2020-04-17 Thread Elliott Sims

If you're upgrading the whole cluster, I'd recommend going ahead and
upgrading all the way to 3.11.6 if possible.  In my experience it's been
noticeably faster, more reliable, and easier to manage compared to 3.0.x.

On Thu, Apr 16, 2020 at 6:37 PM Ashika Umagiliya 
wrote:

> Thank you for the clarifications,
>
> If this is not recommended, our last resort is to upgrade the entire
> cluster.
>
> About Kafka Connect, we sound following Source Connectors which can be
> used to Ingest data from C* to Kafka .
>
> https://debezium.io/documentation/reference/connectors/cassandra.html
> https://docs.lenses.io/2.0/connectors/source/cassandra-cdc.html
> https://docs.lenses.io/2.0/connectors/source/cassandra.html
>
> https://www.datastax.com/press-release/datastax-announces-change-data-capture-cdc-connector-apache-kafka
>
>
>
>
> On Thu, Apr 16, 2020 at 9:42 PM Durity, Sean R <
> sean_r_dur...@homedepot.com> wrote:
>
>> I agree – do not aim for a mixed version as normal. Mixed versions are
>> fine during an upgrade process, but the goal is to complete the upgrade as
>> soon as possible.
>>
>>
>>
>> As for other parts of your plan, the Kafka Connector is a “sink-only,”
>> which means that it can only insert into Cassandra. It doesn’t go the other
>> way.
>>
>>
>>
>> I usually suggest that if the data is needed in two (or more) places,
>> that the application write to a queue. Then, let the queue feed all the
>> downstream destinations.
>>
>>
>>
>>
>>
>> Sean Durity – Staff Systems Engineer, Cassandra
>>
>>
>>
>> *From:* Christopher Bradford 
>> *Sent:* Thursday, April 16, 2020 1:13 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: Multi DC replication between different
>> Cassandra versions
>>
>>
>>
>> It’s worth noting there can be issues with streaming between different
>> versions of C*. Note this excerpt from
>>
>> https://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>> [thelastpickle.com]
>> 
>>
>>
>>
>>
>> Note that with an upgrade it’s important to keep in mind that *streaming
>> in a cluster running mixed versions of Casandra is not recommended*
>>
>>
>>
>> Emphasis mine. With the approach you’re suggesting streaming would be
>> involved both during bootstrap and repair. Would it be possible to upgrade
>> to a more recent release prior to pursuing this course of action?
>>
>>
>>
>> On Thu, Apr 16, 2020 at 1:02 AM Erick Ramirez 
>> wrote:
>>
>> I don't mean any disrespect but let me offer you a friendly advice --
>> don't do it to yourself. I think you would have a very hard time finding
>> someone who would recommend implementing a solution that involves mixed
>> versions. If you run into issues, it would be hell trying to unscramble
>> that egg.
>>
>>
>>
>> On top of that, Cassandra 3.0.9 is an ancient version released 4 years
>> ago (September 2016). There are several pages of fixes deployed since then.
>> So in the nicest possible way, what you're planning to do is not a good
>> idea. I personally wouldn't do it. Cheers!
>>
>> --
>>
>>
>> Christopher Bradford
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>

Re: Hints replays very slow in one DC

2020-02-27 Thread Elliott Sims

I definitely saw a noticeable decrease in GC activity somewhere between
3.11.0 and 3.11.4.  I'm not sure which change did it, but I can't think of
any good reason to use 3.11.0 vs 3.11.6.

I would enable and look through GC logs (or just the slow-GC entries in the
default log) to see if the problem is that it's actually running out of
heap vs falling behind on GC.  For example, if it's doing long mixed or
full GCs and the old-gen space isn't shrinking much it's probably just too
much total data.  If it's just falling behind, there's some things like
InitiatingHeapOccupancyPercent you can tune.

It might also be worth looking at "ttop" from
https://github.com/aragozin/jvm-tools and sorting by heap allocation to see
if you can identify top offenders.

On Thu, Feb 27, 2020 at 9:59 AM Krish Donald  wrote:

> Thanks everyone for the response.
> How to debug more on GC issue ?
> Is there any GC issue which is present in 3.11.0 ?
>
> On Thu, Feb 27, 2020 at 8:46 AM Reid Pinchback 
> wrote:
>
>> Our experience with G1GC was that 31gb wasn’t optimal (for us) because
>> while you have less frequent full GCs they are bigger when they do happen.
>> But even so, not to the point of a 9.5s full collection.
>>
>>
>>
>> Unless it is a rare event associated with something weird happening
>> outside of the JVM (there are some whacky interactions between memory and
>> dirty page writing that could cause it, but not typically), then that is
>> evidence of a really tough fight to reclaim memory.  There are a lot of
>> things that can impact garbage collection performance.  Something is either
>> being pushed very hard, or something is being constrained very tightly
>> compared to resource demand.
>>
>>
>>
>> I’m with Erick, I wouldn’t be putting my attention right now on anything
>> but the GC issue. Everything else that happens within the JVM envelope is
>> going to be a misread on timing until you have stable garbage collection.
>> You might have other issues later, but you aren’t going to know what those
>> are yet.
>>
>>
>>
>> One thing you could at least try to eliminate quickly as a factor.  Are
>> repairs running at the time that things are slow?  In prior to 3.11.5 you
>> lack one of the tuning knobs for doing a tradeoff on memory vs network
>> bandwidth when doing repairs.
>>
>>
>>
>> I’d also make sure you have tuned C* to migrate whatever you reasonably
>> can to be off-heap.
>>
>>
>>
>> Another thought for surprise demands on memory.  I don’t know if this is
>> in 3.11.0, you’ll have to check the C* bash scripts for launching the
>> service.  The number of malloc arenas haven’t always been curtailed, and
>> that could result in an explosion in memory demand.  I just don’t recall
>> where in C* version history that was addressed.
>>
>>
>>
>>
>>
>> *From: *Erick Ramirez 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Wednesday, February 26, 2020 at 9:55 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: Hints replays very slow in one DC
>>
>>
>>
>> *Message from External Sender*
>>
>> Nodes are going down due to Out of Memory and we are using 31GB heap size
>> in DC1 , however in DC2 (Which serves the traffic) has 16GB heap .
>>
>> Why we had to increase heap in DC1 is because , DC1 nodes were going down
>> due Out of Memory issue but DC2 nodes never went down .
>>
>>
>>
>> It doesn't sound right that the primary DC is DC2 but DC1 is under load.
>> You might not be aware of it but the symptom suggests DC1 is getting hit
>> with lots of traffic. If you run netstat (or whatever utility/tool of
>> your choice), you should see established connections to the cluster. That
>> should give you clues as to where it's coming from.
>>
>>
>>
>> We also noticed below kind of messages in system.log
>>
>> FailureDetector.java:288 - Not marking nodes down due to local pause of
>> 9532654114 > 50
>>
>>
>>
>> That's another smoking gun that the nodes are buried in GC. A 9.5-second
>> pause is significant. The slow hinted handoffs is really the least of your
>> problem right now. If nodes weren't going down, there wouldn't be hints to
>> handoff in the first place. Cheers!
>>
>>
>>
>> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have
>> answers! Share your expertise on https://community.datastax.com/
>> 
>> .
>>
>

Re: Nodes becoming unresponsive

2020-02-06 Thread Elliott Sims

Async-profiler (https://github.com/jvm-profiling-tools/async-profiler )
flamegraphs can also be a really good tool to figure out the exact
callgraph that's leading to the futex_wait, both in and out of the JVM.

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-21 Thread Elliott Sims

In addition to extra space, queries can potentially be more expensive
because more dead rows and tombstones will need to be scanned.  How much of
a difference this makes will depend drastically on the schema and access
pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable.

On Tue, Jan 21, 2020 at 2:14 PM Sergio  wrote:

> https://stackoverflow.com/a/22030790
>
>
> For CQLSH
>
> alter table  with GC_GRACE_SECONDS = ;
>
>
>
> Il giorno mar 21 gen 2020 alle ore 13:12 Sergio 
> ha scritto:
>
>> Hi guys!
>>
>> I just wanted to confirm with you before doing such an operation. I
>> expect to increase the space but nothing more than this. I  need to perform
>> just :
>>
>> UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
>>
>> Is it correct?
>>
>> Thanks,
>>
>> Sergio
>>
>

Re: Ec2 instance transient network issues caused 500 errors

2019-12-30 Thread Elliott Sims

On the systems side of things, I've found that using the new BBR TCP
congestion algorithm results in far better behavior in cases of low to
moderate packet loss compared to any of the older strategies.  It can't fix
totally broken, but it takes good advantage of "usable but lossy".  0.5-2%
loss would cripple the cluster, but with BBR it hardly notices.

On Mon, Dec 30, 2019, 8:28 AM Rahul Reddy  wrote:

> Hello,
>
> We have our cassandra cluster running on aws and we have 2 dc’s 6 and 6
> nodes in both regions with RF=3 and cassandra version 3.11.3 . One of the
> ec2 instance had  issues with network issues for 9 minutes. Since it was
> network issue neither ec2 was down nor cassandra down. But this caused high
> coordinator latecnys(5 Seconds) and this impacted 5% of our traffic
> impacted with timeouts. We have both disk_failure_policy and
> commit_failure_policy set to stop.  Please let me know if any work around
> for this ind of issues?
>
> Sent from my iPhone
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Cassandra Recommended System Settings

2019-10-21 Thread Elliott Sims

The TCP settings are basically "how much RAM to use to buffer data for TCP
sessions, per session", which translates roughly to maximum TCP window
size.  You can actually calculate approximately what you need by just
multiplying bandwidth and latency (10,000,000,000bps * .0001s * 1GB/8Gb =
125KB buffer needed to fill the pipe).  In practice, I'd double or triple
the max setting vs the calculated value.  The suggested value from Datastax
is 16MB, which doesn't seem like a lot, but if you have 1, connections
that could lead to up to 16GB of RAM being dedicated to TCP buffers.

As an example, my traffic in and out of Cassandra is within a local 10Gb
network.  I use "409687380   6291456", but that's not particularly
highly-tuned for Cassandra specifically (that is, it's a value also used by
hosts that talk to the outside internet with much higher latency).

On Mon, Oct 21, 2019 at 1:53 PM Sergio  wrote:

> Thanks Elliott!
>
> How do you know if there is too much RAM used for those settings?
>
> Which metrics do you keep track of?
>
> What would you recommend instead?
>
> Best,
>
> Sergio
>
> On Mon, Oct 21, 2019, 1:41 PM Elliott Sims  wrote:
>
>> Based on my experiences, if you have a new enough kernel I'd strongly
>> suggest switching the TCP scheduler algorithm to BBR.  I've found the rest
>> tend to be extremely sensitive to even small amounts of packet loss among
>> cluster members where BBR holds up well.
>>
>> High ulimits for basically everything are probably a good idea, although
>> "unlimited" may not be purely optimal for all cases.
>> The TCP keepalive settings are probably only necessary for traffic
>> buggy/misconfigured firewalls, but shouldn't really do any harm on a modern
>> fast network.
>>
>> The TCP memory settings are pretty aggressive and probably result in
>> unnecessary RAM usage.
>> The net.core.rmem_default/net.core.wmem_default settings are overridden
>> by the TCP-specific settings as far as I know, so they're not really
>> relevant/helpful for Cassandra
>> The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty
>> aggressive.  That works out to something like 1Gbps with 130ms latency per
>> TCP connection, but on a local LAN with latencies <1ms it's enough buffer
>> for over 100Gbps per TCP session.  A much smaller value will probably make
>> more sense for most setups.
>>
>>
>> On Mon, Oct 21, 2019 at 10:21 AM Sergio 
>> wrote:
>>
>>>
>>> Hello!
>>>
>>> This is the kernel that I am using
>>> Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
>>> x86_64 x86_64 x86_64 GNU/Linux
>>>
>>> Best,
>>>
>>> Sergio
>>>
>>> Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
>>> rpinchb...@tripadvisor.com> ha scritto:
>>>
>>>> I don't know which distro and version you are using, but watch out for
>>>> surprises in what vm.swappiness=0 means.  In older kernels it means "only
>>>> use swap when desperate".  I believe that newer kernels changed to have 1
>>>> mean that, and 0 means to always use the oomkiller.  Neither situation is
>>>> strictly good or bad, what matters is what you intend the system behavior
>>>> to be in comparison with whatever monitoring/alerting you have put in 
>>>> place.
>>>>
>>>> R
>>>>
>>>>
>>>> On 10/18/19, 9:04 PM, "Sergio Bilello" 
>>>> wrote:
>>>>
>>>>  Message from External Sender
>>>>
>>>> Hello everyone!
>>>>
>>>>
>>>>
>>>> Do you have any setting that you would change or tweak from the
>>>> below list?
>>>>
>>>>
>>>>
>>>> sudo cat /proc/4379/limits
>>>>
>>>> Limit Soft Limit   Hard Limit
>>>>  Units
>>>>
>>>> Max cpu time  unlimitedunlimited
>>>> seconds
>>>>
>>>> Max file size unlimitedunlimited
>>>> bytes
>>>>
>>>> Max data size unlimitedunlimited
>>>> bytes
>>>>
>>>> Max stack sizeunlimitedunlimited
>>>> bytes
>>>>
>>>> Max core file sizeunlimitedunlimited
>>>> bytes
>>>>
>>>> Max resi

Re: Cassandra Recommended System Settings

2019-10-21 Thread Elliott Sims

Based on my experiences, if you have a new enough kernel I'd strongly
suggest switching the TCP scheduler algorithm to BBR.  I've found the rest
tend to be extremely sensitive to even small amounts of packet loss among
cluster members where BBR holds up well.

High ulimits for basically everything are probably a good idea, although
"unlimited" may not be purely optimal for all cases.
The TCP keepalive settings are probably only necessary for traffic
buggy/misconfigured firewalls, but shouldn't really do any harm on a modern
fast network.

The TCP memory settings are pretty aggressive and probably result in
unnecessary RAM usage.
The net.core.rmem_default/net.core.wmem_default settings are overridden by
the TCP-specific settings as far as I know, so they're not really
relevant/helpful for Cassandra
The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty
aggressive.  That works out to something like 1Gbps with 130ms latency per
TCP connection, but on a local LAN with latencies <1ms it's enough buffer
for over 100Gbps per TCP session.  A much smaller value will probably make
more sense for most setups.


On Mon, Oct 21, 2019 at 10:21 AM Sergio  wrote:

>
> Hello!
>
> This is the kernel that I am using
> Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
> x86_64 x86_64 x86_64 GNU/Linux
>
> Best,
>
> Sergio
>
> Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
> rpinchb...@tripadvisor.com> ha scritto:
>
>> I don't know which distro and version you are using, but watch out for
>> surprises in what vm.swappiness=0 means.  In older kernels it means "only
>> use swap when desperate".  I believe that newer kernels changed to have 1
>> mean that, and 0 means to always use the oomkiller.  Neither situation is
>> strictly good or bad, what matters is what you intend the system behavior
>> to be in comparison with whatever monitoring/alerting you have put in place.
>>
>> R
>>
>>
>> On 10/18/19, 9:04 PM, "Sergio Bilello" 
>> wrote:
>>
>>  Message from External Sender
>>
>> Hello everyone!
>>
>>
>>
>> Do you have any setting that you would change or tweak from the below
>> list?
>>
>>
>>
>> sudo cat /proc/4379/limits
>>
>> Limit Soft Limit   Hard Limit
>>  Units
>>
>> Max cpu time  unlimitedunlimited
>> seconds
>>
>> Max file size unlimitedunlimited
>> bytes
>>
>> Max data size unlimitedunlimited
>> bytes
>>
>> Max stack sizeunlimitedunlimited
>> bytes
>>
>> Max core file sizeunlimitedunlimited
>> bytes
>>
>> Max resident set  unlimitedunlimited
>> bytes
>>
>> Max processes 3276832768
>> processes
>>
>> Max open files1048576  1048576
>> files
>>
>> Max locked memory unlimitedunlimited
>> bytes
>>
>> Max address space unlimitedunlimited
>> bytes
>>
>> Max file locksunlimitedunlimited
>> locks
>>
>> Max pending signals   unlimitedunlimited
>> signals
>>
>> Max msgqueue size unlimitedunlimited
>> bytes
>>
>> Max nice priority 00
>>
>> Max realtime priority 00
>>
>> Max realtime timeout  unlimitedunlimitedus
>>
>>
>>
>> These are the sysctl settings
>>
>> default['cassandra']['sysctl'] = {
>>
>> 'net.ipv4.tcp_keepalive_time' => 60,
>>
>> 'net.ipv4.tcp_keepalive_probes' => 3,
>>
>> 'net.ipv4.tcp_keepalive_intvl' => 10,
>>
>> 'net.core.rmem_max' => 16777216,
>>
>> 'net.core.wmem_max' => 16777216,
>>
>> 'net.core.rmem_default' => 16777216,
>>
>> 'net.core.wmem_default' => 16777216,
>>
>> 'net.core.optmem_max' => 40960,
>>
>> 'net.ipv4.tcp_rmem' => '4096 87380 16777216',
>>
>> 'net.ipv4.tcp_wmem' => '4096 65536 16777216',
>>
>> 'net.ipv4.ip_local_port_range' => '1 65535',
>>
>> 'net.ipv4.tcp_window_scaling' => 1,
>>
>>'net.core.netdev_max_backlog' => 2500,
>>
>>'net.core.somaxconn' => 65000,
>>
>> 'vm.max_map_count' => 1048575,
>>
>> 'vm.swappiness' => 0
>>
>> }
>>
>>
>>
>> Am I missing something else?
>>
>>
>>
>> Do you have any experience to configure CENTOS 7
>>
>> for
>>
>> JAVA HUGE PAGES
>>
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23CheckJavaHugepagessettings=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=b6lGdbtv1SN9opBsIOFRT6IX6BroMW-8Tudk9qEh3bI=
>>
>>
>>
>> OPTIMIZE SSD
>>
>>
>>

Re: snapshots and 'dot' prefixed _index directories

2019-10-01 Thread Elliott Sims

The tar error is because tar also looks for metadata changes.  In this
case, it's the refcount that's changing and causing the error.  I just
switched to using bsdtar instead as a workaround.

On Tue, Oct 1, 2019, 5:37 PM James A. Robinson 
wrote:

> Hi folks,
>
>
> I took a nodetool snapshot of a keyspace in my cassandra 3.11 cluster
> and it included directories with a 'dot' prefix (often called a hidden
> file/directory).  As an example:
>
>
> /var/lib/cassandra/data/impactvizor/tableau_notification-04bfb600291e11e7aeab31f0f0e5804b/snapshots/1569974640/.tableau_notification_alert_id_index
>
> Am I supposed to back up the files under the dot-prefixed directories
> the same as I do the other files?
>
> I ask because tar just complained that one of these files 'changed as
> we read it' which I wouldn't have expected given the documentation of
> how snapshots worked
>
> Jim
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Multiple compactions to same disk with 3.11.4

2019-10-01 Thread Elliott Sims

There's a concurrent_compactors parameter in cassandra.yml that does
exactly what the name says.  You may also find
compaction_throughput_mb_per_sec useful.

On Tue, Oct 1, 2019 at 8:16 AM Matthias Pfau 
wrote:

> Hi there,
> we recently upgraded from 2.2 to 3.11.4.
>
> Unfortunately, we are running into problems with the compaction
> scheduling, now. From time to time, a bunch of compactions (e.g. 6) are
> scheduled for the same data dir. This makes no sense for spinning disks as
> it will slow down all compactions and other operations like flushes
> dramatically.
>
> Has someone else experienced this problem? If so, how did you workaround
> this? Do you know of an open issue regarding this?
>
> Thanks!
>
> Best,
> Matthias
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: SOLR Config on Dev Center

2019-09-25 Thread Elliott Sims

Datastax might be a better resource for this.  This mailing list is pretty
good about questions that apply to DSE and Apache Cassandra, but the SOLR
integration is pretty specific to DSE.

On Wed, Sep 25, 2019 at 7:15 PM kumar bharath 
wrote:

> Hi All,
>
> We are having a 6 node cluster with two data centers(DSE 5.1
> Cassandra).One of the data centers is SOLR enabled.
> Do I need to explicitly add the SOLR Search node in  Dev-Center to perform
> SOLR Queries.?
>
> Thanks for you response in advance.
>
> Regards,
> Bharath Kumar B
>

Re: Multiple C* instances on same machine

2019-09-20 Thread Elliott Sims

A container of some sort gives you better isolation and less risk of a
mistake that could cause the instances to conflict in some way.  Might be
better for balancing resources between them as well, though using cgroups
directly can also accomplish that.

On Fri, Sep 20, 2019, 8:27 AM Nitan Kainth  wrote:

> Hi There,
>
> Any feedback pros/cons for having multiple instances of C* on the same
> machine without Docker/container solution?
>
> The plan is to change the ports and run multiple C* processes, so we can
> isolate two applications as two different clusters.
>

Re: How to delete huge partition in cassandra 3.0.13

2019-08-12 Thread Elliott Sims

It may also be worth upgrading to Cassandra 3.11.4.  There's some changes
in 3.6+ that significantly reduce heap pressure from very large partitions.

On Mon, Aug 12, 2019 at 9:13 AM Gabriel Giussi 
wrote:

> I've found a huge partion (~9GB) in my cassandra cluster because I'm
> loosing 3 nodes recurrently due to OutOfMemoryError
>
>> ERROR [SharedPool-Worker-12] 2019-08-12 11:07:45,735
>> JVMStabilityInspector.java:140 - JVM state determined to be unstable.
>> Exiting forcefully due to:
>> java.lang.OutOfMemoryError: Java heap space
>> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) ~[na:1.8.0_151]
>> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[na:1.8.0_151]
>> at
>> org.apache.cassandra.io.util.DataOutputBuffer.reallocate(DataOutputBuffer.java:126)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.io.util.DataOutputBuffer.doFlush(DataOutputBuffer.java:86)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:132)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.write(BufferedDataOutputStreamPlus.java:151)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.utils.ByteBufferUtil.writeWithVIntLength(ByteBufferUtil.java:297)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.marshal.AbstractType.writeValue(AbstractType.java:373)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.BufferCell$Serializer.serialize(BufferCell.java:267)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:193)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:109)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:97)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:132)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:87)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:77)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:301)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:145)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:138)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:134)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:76)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:321)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:47)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> ~[na:1.8.0_151]
>> at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164)
>> ~[apache-cassandra-3.0.13.jar:3.0.13]
>> at
>> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:136)
>> [apache-cassandra-3.0.13.jar:3.0.13]
>> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
>> [apache-cassandra-3.0.13.jar:3.0.13]
>> at java.lang.Thread.run(Thread.java:748) [na:1.8.0_151]
>>
>
> From the stacktrace I assume that some client is try to read that
> partition (ReadResponse) so I may filter requests to this specific
> partition as a quick solution but I think the compaction will never be able
> to remove this partition (I already executed a DELETE).
> What can I do to delete this partition? May I delete the sstable directly?
> Or should I upgrade the node and give more heap to cassandra?
>
> Thanks.
>

Re: Optimal Heap Size Cassandra Configuration

2019-05-20 Thread Elliott Sims

It's not really something that can be easily calculated based on write
rate, but more something you have to find empirically and adjust
periodically.
Generally speaking, I'd start by running "nodetool gcstats" or similar and
just see what the GC pause stats look like.  If it's not pausing much or
for long, you're good.  If it is, you'll likely need to do some tuning
based on GC logging which may involve increasing the heap but could also
mean decreasing it or changing the collection strategy.

Generally speaking, with G1GC you can get away with just setting a larger
heap than you really need and it's close enough to optimal.  CMS is
theoretically more efficient, but far more complex to get tuned properly
and tends to fail more dramatically.

On Mon, May 20, 2019 at 7:38 AM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi Experts,
>
> I have a 5 node cluster with 8 core CPU and 32 GiB RAM
>
> If I have a write TPS of 5K/s and read TPS of 8K/s, I want to know what is
> the optimal heap size configuration for each cassandra node.
>
> Currently, the heap size is set at 8GB. How can I know if cassandra
> requires more or less heap memory?
>
> Akshay Bhardwaj
> +91-97111-33849
>

Re: Five Questions for Cassandra Users

2019-03-28 Thread Elliott Sims

1.   Do the same people where you work operate the cluster and write
the code to develop the application?

Mostly.  Ops vs dev, although there's some overlap

2.   Do you have a metrics stack that allows you to see graphs of
various metrics with all the nodes displayed together?

 Yes, Prometheus+Grafana (currently custom script reporting to Prometheus,
but that needs revisiting)

3.   Do you have a log stack that allows you to see the logs for all
the nodes together?

 Yep, graylog.

4.   Do you regularly repair your clusters - such as by using Reaper?

 Yes, with reaper.  Every day or two, more or less.  It would be
almost-constant if Reaper could work off queues with blacklisted time
windows instead of a schedule

5.   Do you use artificial intelligence to help manage your clusters?

No.

On Thu, Mar 28, 2019 at 8:46 AM Tom van der Woerdt
 wrote:

> 1.   Do the same people where you work operate the cluster and write
> the code to develop the application?
>
> No, we have a small infrastructure team, and many people developing
> applications using Cassandra
>
> 2.   Do you have a metrics stack that allows you to see graphs of
> various metrics with all the nodes displayed together?
>
> Yes, we use a re-implementation of Graphite, which we open-sourced and now
> lives at https://github.com/go-graphite
>
> 3.   Do you have a log stack that allows you to see the logs for all
> the nodes together?
>
> Yes, although in practice we don't use it much for Cassandra
>
> 4.   Do you regularly repair your clusters - such as by using Reaper?
>
> Yes, we have built our own tools for this
>
> 5.   Do you use artificial intelligence to help manage your clusters?
>
> It's not "artificial intelligence" the way most people would describe it,
> but we certainly don't run our clusters manually
>
>
>
> Tom van der Woerdt
> Site Reliability Engineer
>
> Booking.com B.V.
> Vijzelstraat 66-80 Amsterdam 1017HL Netherlands
> [image: Booking.com] 
> Empowering people to experience the world since 1996
> 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29
> million reported listings
> Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
>
>
> On Thu, Mar 28, 2019 at 10:03 AM Kenneth Brotman
>  wrote:
>
>> I’m looking to get a better feel for how people use Cassandra in
>> practice.  I thought others would benefit as well so may I ask you the
>> following five questions:
>>
>>
>>
>> 1.   Do the same people where you work operate the cluster and write
>> the code to develop the application?
>>
>>
>>
>> 2.   Do you have a metrics stack that allows you to see graphs of
>> various metrics with all the nodes displayed together?
>>
>>
>>
>> 3.   Do you have a log stack that allows you to see the logs for all
>> the nodes together?
>>
>>
>>
>> 4.   Do you regularly repair your clusters - such as by using Reaper?
>>
>>
>>
>> 5.   Do you use artificial intelligence to help manage your clusters?
>>
>>
>>
>>
>>
>> Thank you for taking your time to share this information!
>>
>>
>>
>> Kenneth Brotman
>>
>

Re: Garbage Collector

2019-03-19 Thread Elliott Sims

I use G1, and I think it's actually the default now for newer Cassandra
versions.  For G1, I've done very little custom config/tuning.  I increased
heap to 16GB (out of 64GB physical), but most of the rest is at or near
default.  For the most part, it's been "feed it more RAM, and it works"
compared to CMS's "lower overhead, works great until it doesn't" and dozens
of knobs.

I haven't tried ZGC yet, but anecdotally I've heard that it doesn't really
match or beat G1 quite yet.

On Tue, Mar 19, 2019 at 9:44 AM Ahmed Eljami  wrote:

> Hi Folks,
>
> Does someone use G1 GC or ZGC on production?
>
> Can you share your feedback, the configuration used if it's possible ?
>
> Thanks.
>
>

Re: Restore a table with dropped columns to a new cluster fails

2019-02-19 Thread Elliott Sims

When a snapshot is taken, it includes a "schema.cql" file.  That should be
sufficient to restore whatever you need to restore.  I'd argue that neither
automatically resurrecting a dropped table nor silently failing to restore
it is a good behavior, so it's not unreasonable to have the user re-create
the table then choose if they want to re-drop it.

On Tue, Feb 19, 2019 at 7:28 AM Hannu Kröger  wrote:

> Hi,
>
> I would like to bring this issue to your attention.
>
> Link to the ticket:
> https://issues.apache.org/jira/browse/CASSANDRA-14336
>
> Basically if a table contains dropped columns and you try to restore a
> snapshot to a new cluster, that will fail because of an error like
> "java.lang.RuntimeException: Unknown column XXX during deserialization”.
>
> I feel this is quite serious problem for backup and restore functionality
> of Cassandra. You cannot restore a backup to a new cluster if columns have
> been dropped.
>
> There have been other similar tickets that have been apparently closed but
> based on my test with 3.11.4, the issue still persists.
>
> Best Regards,
> Hannu Kröger
>

Re: High GC pauses leading to client seeing impact

2019-02-11 Thread Elliott Sims

I would strongly suggest you consider an upgrade to 3.11.x.  I found it
decreased space needed by about 30% in addition to significantly lowering
GC.

As a first step, though, why not just revert to CMS for now if that was
working ok for you?  Then you can convert one host for diagnosis/tuning so
the cluster as a whole stays functional.

That's also a pretty old version of the JDK to be using G1.  I would
definitely upgrade that to 1.8u202 and see if the problem goes away.

On Sun, Feb 10, 2019, 10:22 PM Rajsekhar Mallick  Hello Team,
>
> I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
> Cassandra version: 2.0.11
> Client connecting using thrift over port 9160
> Jdk version : 1.8.066
> GC used : G1GC (16GB heap)
> Other GC settings:
> Maxgcpausemillis=200
> Parallels gc threads=32
> Concurrent gc threads= 10
> Initiatingheapoccupancypercent=50
> Number of cpu cores for each system : 40
> Memory size: 185 GB
> Read/sec : 300 /sec on each node
> Writes/sec : 300/sec on each node
> Compaction strategy used : Size tiered compaction strategy
>
> Identified issues in the cluster:
> 1. Disk space usage across all nodes in the cluster is 80%. We are
> currently working on adding more storage on each node
> 2. There are 2 tables for which we keep on seeing large number of
> tombstones. One of table has read requests seeing 120 tombstones cells in
> last 5 mins as compared to 4 live cells. Tombstone warns and Error messages
> of query getting aborted is also seen.
>
> Current issue sen:
> 1. We keep on seeing GC pauses of few minutes randomly across nodes in the
> cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
> 2. This leads to nodes getting stalled and client seeing direct impact
> 3. The GC pause we see, are not during any of G1GC phases. The GC log
> message prints “Time to stop threads took 770 seconds”. So it is not the
> garbage collector doing any work but stopping the threads at a safe point
> is taking so much of time.
> 4. This issue has surfaced recently after we changed 8GB(CMS) to
> 16GB(G1GC) across all nodes in the cluster.
>
> Kindly do help on the above issue. I am not able to exactly understand if
> the GC is wrongly tuned, other if this is something else.
>
> Thanks,
> Rajsekhar Mallick
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Data storage space unbalance issue

2018-12-03 Thread Elliott Sims

It depends on the type of repair, but you'll want to make sure all the data
is where it should be before running cleanup.  Somewhat related, if you're
not running regular repairs already, you should be.  You can do it via
cron, but I strongly suggest checking out Reaper.

On Wed, Nov 28, 2018, 8:05 PM Eunsu Kim  Thank you for your response.
>
> I will run repair from datacenter2 with your advice. Do I have to run
> repair on every node in datacenter2?
>
> There is no snapshot when checked with nodetool listsnaphosts.
>
> Thank you.
>
> On 29 Nov 2018, at 4:31 AM, Elliott Sims  wrote:
>
> I think you answered your own question, sort of.
>
> When you expand a cluster, it copies the appropriate rows to the new
> node(s) but doesn't automatically remove them from the old nodes.  When you
> ran cleanup on datacenter1, it cleared out those old extra copies.  I would
> suggest running a repair first for safety on datacenter2, then a "nodetool
> cleanup" on those hosts.
>
> Also run "nodetool snapshot" to make sure you don't have any old snapshots
> sitting around taking up space.
>
> On Wed, Nov 28, 2018 at 5:29 AM Eunsu Kim  wrote:
>
>> (I am sending the previous mail again because it seems that it has not
>> been sent properly.)
>>
>> HI experts,
>>
>> I am running 2 datacenters each containing five nodes. (total 10 nodes,
>> all 3.11.3)
>>
>> My data is stored one at each data center. (REPLICATION = { 'class' :
>> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1':
>> '1', 'datacenter2': '1’ })
>>
>> Most of my data have a short TTL(14days). The gc_grace_seconds value for
>> all tables is also 600sec.
>>
>> I expect the two data centers to use the same size but datacenter2 is
>> using more size. It seems that the datas of datacenter2 is rarely
>> deleted. While the disk usage for datacenter1 remains constant, the disk
>> usage for datacenter2 continues to grow.
>>
>> ——
>> Datacenter: datacenter1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address   Load   Tokens   Owns (effective)  Host ID
>> Rack
>> UN  10.61.58.228  925.48 GiB  256  21.5%
>> 60d1bac8-b4d6-4e02-a05f-badee0bb36f5  rack1
>> UN  10.61.58.167  840 GiB256  20.0%
>> a04fc77a-907f-490c-971c-4e1f964c7b14  rack1
>> UN  10.61.75.86   1.13 TiB   256  19.3%
>> 618c101b-036d-42e7-bf9f-2bcbd429cbd1  rack1
>> UN  10.61.59.22   844.19 GiB  256  20.0%
>> d8a4a165-13f0-4f4a-9278-4024730b8116  rack1
>> UN  10.61.59.82   737.88 GiB  256  19.2%
>> 054a4eb5-6d1c-46fa-b550-34da610da4e0  rack1
>> Datacenter: datacenter2
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address   Load   Tokens   Owns (effective)  Host ID
>> Rack
>> UN  10.42.6.120   1.11 TiB   256  18.6%
>> 69f15be0-e5a1-474e-87cf-b063e6854402  rack1
>> UN  10.42.5.207   1.17 TiB   256  20.0%
>> f78bdce5-cb01-47e0-90b9-fcc31568e49e  rack1
>> UN  10.42.6.471.01 TiB   256  20.1%
>> 3ff93b47-2c15-4e1a-a4ea-2596f26b4281  rack1
>> UN  10.42.6.481007.67 GiB  256  20.4%
>> 8cbbe76d-6496-403a-8b09-fe6812c9dea2  rack1
>> UN  10.42.5.208   1.29 TiB   256  20.9%
>> 4aa96c6a-6083-417f-a58a-ec847bcbfc7e  rack1
>> --
>>
>> A few days ago, one node of datacenter1 broke down and replaced it, and I
>> worked on rebuild, repair, and cleanup.
>>
>>
>> What else can I do?
>>
>> Thank you in advance.
>>
>
>

Re: Data storage space unbalance issue

2018-11-28 Thread Elliott Sims

I think you answered your own question, sort of.

When you expand a cluster, it copies the appropriate rows to the new
node(s) but doesn't automatically remove them from the old nodes.  When you
ran cleanup on datacenter1, it cleared out those old extra copies.  I would
suggest running a repair first for safety on datacenter2, then a "nodetool
cleanup" on those hosts.

Also run "nodetool snapshot" to make sure you don't have any old snapshots
sitting around taking up space.

On Wed, Nov 28, 2018 at 5:29 AM Eunsu Kim  wrote:

> (I am sending the previous mail again because it seems that it has not
> been sent properly.)
>
> HI experts,
>
> I am running 2 datacenters each containing five nodes. (total 10 nodes,
> all 3.11.3)
>
> My data is stored one at each data center. (REPLICATION = { 'class' :
> 'org.apache.cassandra.locator.NetworkTopologyStrategy', 'datacenter1': '1'
> , 'datacenter2': '1’ })
>
> Most of my data have a short TTL(14days). The gc_grace_seconds value for
> all tables is also 600sec.
>
> I expect the two data centers to use the same size but datacenter2 is
> using more size. It seems that the datas of datacenter2 is rarely
> deleted. While the disk usage for datacenter1 remains constant, the disk
> usage for datacenter2 continues to grow.
>
> ——
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address   Load   Tokens   Owns (effective)  Host ID
> Rack
> UN  10.61.58.228  925.48 GiB  256  21.5%
> 60d1bac8-b4d6-4e02-a05f-badee0bb36f5  rack1
> UN  10.61.58.167  840 GiB256  20.0%
> a04fc77a-907f-490c-971c-4e1f964c7b14  rack1
> UN  10.61.75.86   1.13 TiB   256  19.3%
> 618c101b-036d-42e7-bf9f-2bcbd429cbd1  rack1
> UN  10.61.59.22   844.19 GiB  256  20.0%
> d8a4a165-13f0-4f4a-9278-4024730b8116  rack1
> UN  10.61.59.82   737.88 GiB  256  19.2%
> 054a4eb5-6d1c-46fa-b550-34da610da4e0  rack1
> Datacenter: datacenter2
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address   Load   Tokens   Owns (effective)  Host ID
> Rack
> UN  10.42.6.120   1.11 TiB   256  18.6%
> 69f15be0-e5a1-474e-87cf-b063e6854402  rack1
> UN  10.42.5.207   1.17 TiB   256  20.0%
> f78bdce5-cb01-47e0-90b9-fcc31568e49e  rack1
> UN  10.42.6.471.01 TiB   256  20.1%
> 3ff93b47-2c15-4e1a-a4ea-2596f26b4281  rack1
> UN  10.42.6.481007.67 GiB  256  20.4%
> 8cbbe76d-6496-403a-8b09-fe6812c9dea2  rack1
> UN  10.42.5.208   1.29 TiB   256  20.9%
> 4aa96c6a-6083-417f-a58a-ec847bcbfc7e  rack1
> --
>
> A few days ago, one node of datacenter1 broke down and replaced it, and I
> worked on rebuild, repair, and cleanup.
>
>
> What else can I do?
>
> Thank you in advance.
>

Re: How to set num tokens on live node

2018-11-01 Thread Elliott Sims

As far as I know, it's not possible to change it live.  You have to create
a new "datacenter" with new hosts using the new num_tokens value, then
switch everything to use the new DC and tear down the old.

On Thu, Nov 1, 2018 at 6:16 PM Goutham reddy 
wrote:

> Hi team,
> Can someone help me out I don’t find anywhere how to change the numtokens
> on a running nodes. Any help is appreciated
>
> Thanks and Regards,
> Goutham.
> --
> Regards
> Goutham Reddy
>

Re: Cassandra: Inconsistent data on reads (LOCAL_QUORUM)

2018-10-12 Thread Elliott Sims

I'll second that - we had some weird inconsistent reads for a long time
that we finally tracked to a small number of clients with significant clock
skew.  Make very sure all your client (not just C*) machines have
tightly-synced clocks.

On Fri, Oct 12, 2018 at 7:40 PM maitrayee shah 
wrote:

> We have seen inconsistent read if the clock on the nodes are not in sync.
>
>
> Thank you
>
> Sent from my iPhone
>
> On Oct 12, 2018, at 1:50 PM, Naik, Ninad  wrote:
>
> Hello,
>
> We're seeing inconsistent data while doing reads on cassandra. Here are
> the details:
>
> It's is a wide column table. The columns can be added my multiple
> machines, and read by multiple machines. The time between writes and reads
> are in minutes, but sometimes can be in seconds. Writes happen every 2
> minutes.
>
> Now, while reading we're seeing the following cases of inconsistent reads:
>
>- One column was added. If a read was done after the column was added
>(20 secs to 2 minutes after the write), Cassandra returns no data. As if
>the key doesn't exist. If the application retries, it gets the data.
>- A few columns exist for a row key. And a new column 'n' was added.
>Again, a read happens a few minutes after the write. This time, only the
>latest column 'n' is returned. In this case the app doesn't know that the
>data is incomplete so it doesn't retry. If we manually retry, we see all
>the columns.
>- A few columns exist for a row key. And a new column 'n' is added.
>When a read happens after the write, all columns but 'n' are returned.
>
> Here's what we've verified:
>
>- Both writes and reads are using 'LOCAL_QUORUM' consistency level.
>- The replication is within local data center. No remote data center
>is involved in the read or write.
>- During the inconsistent reads, none of the nodes are undergoing GC
>pauses
>- There are no errors in cassandra logs
>- Reads always happen after the writes.
>
> A few other details: Cassandra version: 2.1.9 DataStax java driver
> version: 2.1.10.2 Replication Factor: 3
>
> We don't see this problem in lower environments. We have seen this happen
> once or twice last year, but since last few days it's happening quite
> frequently. On an average 2 inconsistent reads every minute.
>
> Here's how the table definition looks like:
>
> CREATE TABLE "MY_TABLE" (
>   key text,
>   sub_key text,
>   value text,
>   PRIMARY KEY ((key), sub_key)
> ) WITH
>   bloom_filter_fp_chance=0.01 AND
>   caching='{"keys":"ALL", "rows_per_partition":"NONE"}' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.10 AND
>   gc_grace_seconds=864000 AND
>   read_repair_chance=0.00 AND
>   default_time_to_live=0 AND
>   speculative_retry='ALWAYS' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
> Please point us in the right direction. Thanks !
>
>
>
> The information contained in this e-mail message and any attachments may
> be privileged and confidential. If the reader of this message is not the
> intended recipient or an agent responsible for delivering it to the
> intended recipient, you are hereby notified that any review, dissemination,
> distribution or copying of this communication is strictly prohibited. If
> you have received this communication in error, please notify the sender
> immediately by replying to this e-mail and delete the message and any
> attachments from your computer.
>
>

Re: High IO and poor read performance on 3.11.2 cassandra cluster

2018-09-11 Thread Elliott Sims

A few reasons I can think of offhand why your test setup might not see
problems from large readahead:
Your sstables are <4MB or your reads are typically <4MB from the end of the
file
Your queries tend to use the 4MB of data anyways
Your dataset is small enough that most of it fits in the VM cache, and it
rarely goes to disk
Load is low enough that the read I/O amplification doesn't hurt performance
Less likely but still possible is that there's a subtle difference in the
way that 2.1 does reads vs 3.x that's affecting it.  The less subtle
explanation is that 3.x has smaller rows and a smaller readahead is
therefore probably optimal, but that would only decrease your performance
benefit and not cause a regression from 2.1->3.x.

On Mon, Sep 10, 2018 at 1:27 AM, Laxmikant Upadhyay  wrote:

> Thank you so much Alexander !
>
> Your doubt was right. It was due to the very high value of readahead only
> (4 mb).
>
> Although We had set readahead value to 8kb in our /etc/rc.local but some
> how this was not working.
> we are keeping the value to 64 kb as we this is giving better performance
> than 8kb. Now we are able to meet our sla.
>
> One interesting observation is that we have a setup on cassandra 2.1.16
> also and on that system the readahead value is 4mb only but we are not
> observing any performance dip there. I am not sure why.
>
>
> On Wed, Sep 5, 2018 at 11:31 AM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Don't forget to run "nodetool upgradesstables -a" after you ran the ALTER
>> statement so that all SSTables get re-written with the new compression
>> settings.
>>
>> Since you have a lot of tables in your cluster, be aware that lowering
>> the chunk length will grow the offheap memory usage of Cassandra.
>> You can get more informations here : http://thelastpickle.com/
>> blog/2018/08/08/compression_performance.html
>>
>> You should also check your readahead settings as it may be set too high :
>> sudo blockdev --report
>> The default is usually 256 but Cassandra would rather favor low readahead
>> values to get more IOPS instead of more throughput (and readahead is
>> usually not that useful for Cassandra). A conservative setting is 64 (you
>> can go down to 8 and see how Cassandra performs then).
>> Do note that changing the readahead settings requires to restart
>> Cassandra as it is only read once by the JVM during startup.
>>
>> Cheers,
>>
>> On Wed, Sep 5, 2018 at 7:27 AM CPC  wrote:
>>
>>> Could you decrease chunk_length_in_kb to 16 or 8 and repeat the test.
>>>
>>> On Wed, Sep 5, 2018, 5:51 AM wxn...@zjqunshuo.com 
>>> wrote:
>>>
 How large is your row? You may meet reading wide row problem.

 -Simon

 *From:* Laxmikant Upadhyay 
 *Date:* 2018-09-05 01:01
 *To:* user 
 *Subject:* High IO and poor read performance on 3.11.2 cassandra
 cluster

 We have 3 node cassandra cluster (3.11.2) in single dc.

 We have written 450 million records on the table with LCS. The write
 latency is fine.  After write we perform read and update operations.

 When we run read+update operations on newly inserted 1 million records
 (on top of 450 m records) then the read latency and io usage is under
 control. However when we perform read+update on old 1 million records which
 are part of 450 million records we observe high read latency (The
 performance goes down by 4 times in comparison 1st case ).  We have not
 observed major gc pauses.

 *system information:*
 *cpu core :*  24
 *disc type : *ssd . we are using raid with deadline schedular
 *disk space:*
 df -h :
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sdb11.9T  393G  1.5T  22%
 /var/lib/cassandra
 *memory:*
 free -g
   totalusedfree  shared  buff/cache
  available
 Mem: 62  30   0   0  32
   31
 Swap: 8   0   8

 ==

 *schema*

 desc table ks.xyz;

 CREATE TABLE ks.xyz (
 key text,
 column1 text,
 value text,
 PRIMARY KEY (key, column1)
 ) WITH COMPACT STORAGE
 AND CLUSTERING ORDER BY (column1 ASC)
 AND bloom_filter_fp_chance = 0.1
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND comment = ''
 AND compaction = {'class': 'org.apache.cassandra.db.compaction.
 LeveledCompactionStrategy'}
 AND compression = {'chunk_length_in_kb': '64', 'class': '
 org.apache.cassandra.io.compress.LZ4Compressor'}
 AND crc_check_chance = 1.0
 AND dclocal_read_repair_chance = 0.0
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND

Re: Cluster CPU usage limit

2018-09-06 Thread Elliott Sims

It's interesting and a bit surprising that 256 write threads isn't enough.
Even with a lot of cores, I'd expect you to be able to saturate CPU with
that many threads.  I'd make sure you don't have other bottlenecks, like
GC, IOPs, network, or "microbursts" where your load is actually fluctuating
between 20-100% CPU.
Admittedly, I actually did get best results with 256 threads (and haven't
tested higher, but lower is definitely not enough), but every advice I've
seen is for a lower write thread count being optimal for most cases.

On Thu, Sep 6, 2018 at 5:51 AM, onmstester onmstester 
wrote:

> IMHO, Cassandra write is more of a CPU bound task, so while determining
> cluster write throughput, what CPU usage percent (avg among all cluster
> nodes) should be determined as limit?
> Rephrase: what's the normal CPU usage in Cassandra cluster (while no
> compaction, streaming or heavy-read running) ?
> For a cluster with 10 nodes, i got 700K write per seconds for my data
> model, average cpu load is about 40%, i'm going to increase number of
> native threads (now is 256) and native queue (1024) to increase throughput
> (and CPU usage subsequently).
>
> Sent using Zoho Mail 
>
>
>

Re: benefits oh HBase over Cassandra

2018-08-24 Thread Elliott Sims

At the time that Facebook chose HBase, Cassandra was drastically less
mature than it is now and I think the original creators had already left.
There were already various Hadoop variants running for data analytics etc,
so lots of operational and engineering experience around it available.  So,
probably not a useful example to use in a technical comparison between
current HBase and current Cassandra.  Also, FB has since abandoned HBase
for messenger in favor of MyRocks.

On Fri, Aug 24, 2018 at 5:43 PM, dinesh.jo...@yahoo.com.INVALID <
dinesh.jo...@yahoo.com.invalid> wrote:

> I've worked with both databases. They're suitable for different use-cases.
> If you look at the CAP theorem; HBase is CP while Cassandra is a AP. If we
> talk about a specific use-case, it'll be easier to discuss.
>
> Dinesh
>
>
> On Friday, August 24, 2018, 1:56:31 PM PDT, Vitaliy Semochkin <
> vitaliy...@gmail.com> wrote:
>
>
> Hi,
>
> I read that once Facebook chose HBase over Cassandra for it's messenger,
> but I never found what are the benefits for HBase over Cassandra,
> can someone list, if there are any?
>
> Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Huge daily outbound network traffic

2018-08-16 Thread Elliott Sims

Since this is cross-node traffic, "nodetool netstats" during the
high-traffic period should give you a better idea of what's being sent.

On Thu, Aug 16, 2018 at 2:34 AM, Behnam B.Marandi <
behnam.b.mara...@gmail.com> wrote:

> In case of cronjobs, there is no jobs for that time period and I can see
> affect of jobs like backups and repairs but traffic that they cause is not
> comparable. Like 800MB comparing to 2GB. And for this case it is all
> outbound network on all 3 cluster nodes.
>
> On Thu, Aug 16, 2018 at 5:16 PM dinesh.jo...@yahoo.com.INVALID <
> dinesh.jo...@yahoo.com.invalid> wrote:
>
>> Since it is predictable, can you check the logs during that period? What
>> do they say? Do you have a cron running on those hosts? Do all the nodes
>> experience this issue?
>>
>> Dinesh
>>
>>
>> On Thursday, August 16, 2018, 12:02:55 AM PDT, Behnam B.Marandi <
>> behnam.b.mara...@gmail.com> wrote:
>>
>>
>> Actually I did. It seems this is a cross node traffic from one node to
>> port 7000 (storage_port) of the other node.
>>
>> On Sun, Aug 12, 2018 at 2:44 PM Elliott Sims 
>> wrote:
>>
>> Since it's at a consistent time, maybe just look at it with iftop to see
>> where the traffic's going and what port it's coming from?
>>
>> On Fri, Aug 10, 2018 at 1:48 AM, Behnam B.Marandi <
>> behnam.b.mara...@gmail.com> wrote:
>>
>> I don't have any external process or planed repair in that time period.
>> In case of network, I can see outbound network on Cassandra node network
>> interface but couldn't find any way to check the VPC network to make sure
>> it is not going out of network. Maybe the only way is analysing VPC Flow
>> Log.
>> B.
>>
>> On Tue, Aug 7, 2018 at 11:23 PM, Rahul Singh <
>> rahul.xavier.si...@gmail.com> wrote:
>>
>> Are you sure you don’t have an outside process that is doing an export ,
>> Spark job, non AWS managed backup process ?
>>
>> Is this network out from Cassandra or from the network?
>>
>>
>> Rahul
>> On Aug 7, 2018, 4:09 AM -0400, Behnam B.Marandi , wrote:
>>
>> Hi,
>> I have a 3 node Cassandra cluster (version 3.11.1) on m4.xlarge EC2
>> instances with separate EBS volumes for root (gp2), data (gp2) and
>> commitlog (io1).
>> I get daily outbound traffic at a certain time everyday. As you can see
>> in the attached screenshot, whiile my normal networkl oad hardly meets
>> 200MB, this outbound (orange) spikes up to 2GB while inbound (purple) is
>> less than 800MB.
>> There is no repair or backup process giong on in that time window, so I
>> am wondering where to look. Any idea?
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>>
>>

Re: Improve data load performance

2018-08-15 Thread Elliott Sims

For write threads, check "nodetool tpstats"

Are you loading the data serially?  That is, one query at a time?  If so
(and if you have no clear resource bottlenecks) you're probably going to
want to add some concurrency into the process.  Break the data up into
smaller chunks and have several threads inserting at once.

On Wed, Aug 15, 2018 at 1:35 PM, Abdul Patel  wrote:

> User in dev env with 4 node cluster , 50k records with inserts of 70k
> characters (json in text)
> This will happen daily in some intervals not yet defined on a single table.
> Its within 1 data center
>
>
> On Wednesday, August 15, 2018, Durity, Sean R 
> wrote:
>
>> Might also help to know:
>>
>> Size of cluster
>>
>> How much data is being loaded (# of inserts/actual data size)
>>
>> Single table or multiple tables?
>>
>> Is this a one-time or occasional load or more frequently?
>>
>> Is the data located in the same physical data center as the cluster? (any
>> network latency?)
>>
>>
>>
>> On the client side, prepared statements and ExecuteAsync can really speed
>> things up.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Elliott Sims 
>> *Sent:* Wednesday, August 15, 2018 1:13 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: Improve data load performance
>>
>>
>>
>> Step one is always to measure your bottlenecks.  Are you spending a lot
>> of time compacting?  Garbage collecting?  Are you saturating CPU?  Or just
>> a few cores?  Or I/O?  Are repairs using all your I/O?  Are you just
>> running out of write threads?
>>
>>
>>
>> On Wed, Aug 15, 2018 at 5:48 AM, Abdul Patel  wrote:
>>
>> Application team is trying to load data with leveled compaction and its
>> taking 1hr to load , what are  best options to load data faster ?
>>
>>
>>
>> On Tuesday, August 14, 2018, @Nandan@ 
>> wrote:
>>
>> Bro, Please explain your question as much as possible.
>> This is not a single line Q session where we will able to understand
>> your in-depth queries in a single line.
>> For better and suitable reply, Please ask a question and elaborate what
>> steps you took for your question and what issue are you getting and all..
>>
>> I hope I am making it clear. Don't take it personally.
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Wed, Aug 15, 2018 at 8:25 AM Abdul Patel  wrote:
>>
>> How can we improve data load performance?
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>

Re: "minimum backup" in vnodes

2018-08-15 Thread Elliott Sims

Assuming this isn't an existing cluster, the easiest method is probably to
use logical "racks" to explicitly control which hosts have a full replica
of the data.  with RF3 and 3 "racks", each "rack" has one complete replica.
If you're not using the logical racks, I think the replicas are spread
randomly and you can't generally reduce your backup count safely even with
a lot of work.

On Wed, Aug 15, 2018 at 1:39 PM, Carl Mueller <
carl.muel...@smartthings.com.invalid> wrote:

> Goal: backup a cluster with the minimum amount of data. Restore to be done
> with sstableloader
>
> Let's start with a basic case:
> - six node cluster
> - one datacenter
> - RF3
> - data is perfectly replicated/repaired
> - Manual tokens (no vnodes)
> - simplest strategy
>
> In this case, it is (theoretically) possible to get an perfect backup of
> data by storing the snapshots of two of the six nodes in the cluster due to
> replication factor.
>
> I once tried to parse the ring output with vnodes (256) and came to the
> conclusion that it was not possible with vnodes, maybe you could avoid one
> or two nodes of the six... tops. But I may have had an incorrect
> understanding of how ranges are replicated in vnodes.
>
> Would it be possible to pick only two nodes out of a six node cluster with
> vnodes and RF-3 that will backup the cluster?
>
>

Re: Improve data load performance

2018-08-15 Thread Elliott Sims

Step one is always to measure your bottlenecks.  Are you spending a lot of
time compacting?  Garbage collecting?  Are you saturating CPU?  Or just a
few cores?  Or I/O?  Are repairs using all your I/O?  Are you just running
out of write threads?

On Wed, Aug 15, 2018 at 5:48 AM, Abdul Patel  wrote:

> Application team is trying to load data with leveled compaction and its
> taking 1hr to load , what are  best options to load data faster ?
>
>
> On Tuesday, August 14, 2018, @Nandan@ 
> wrote:
>
>> Bro, Please explain your question as much as possible.
>> This is not a single line Q session where we will able to understand
>> your in-depth queries in a single line.
>> For better and suitable reply, Please ask a question and elaborate what
>> steps you took for your question and what issue are you getting and all..
>>
>> I hope I am making it clear. Don't take it personally.
>>
>> Thanks
>>
>> On Wed, Aug 15, 2018 at 8:25 AM Abdul Patel  wrote:
>>
>>> How can we improve data load performance?
>>
>>

Re: upgrade 2.1 to 3.0

2018-08-11 Thread Elliott Sims

Might be a silly question, but did you run "nodetool upgradesstables" and
convert to the 3.0 format?  Also, which 3.0?  Newest, or an earlier 3.0.x?

On Fri, Aug 10, 2018 at 3:05 PM, kooljava2 
wrote:

> Hello,
>
> We recently upgrade C* from 2.1 to 3.0. After the upgrade we are seeing
> increase in the  total read bytes and read ops on the EBS volumes. It
> almost doubled on all the nodes.  The number of writes are same.
>
>
> Thank you.
>

Re: Huge daily outbound network traffic

2018-08-11 Thread Elliott Sims

Since it's at a consistent time, maybe just look at it with iftop to see
where the traffic's going and what port it's coming from?

On Fri, Aug 10, 2018 at 1:48 AM, Behnam B.Marandi <
behnam.b.mara...@gmail.com> wrote:

> I don't have any external process or planed repair in that time period.
> In case of network, I can see outbound network on Cassandra node network
> interface but couldn't find any way to check the VPC network to make sure
> it is not going out of network. Maybe the only way is analysing VPC Flow
> Log.
> B.
>
> On Tue, Aug 7, 2018 at 11:23 PM, Rahul Singh  > wrote:
>
>> Are you sure you don’t have an outside process that is doing an export ,
>> Spark job, non AWS managed backup process ?
>>
>> Is this network out from Cassandra or from the network?
>>
>>
>> Rahul
>> On Aug 7, 2018, 4:09 AM -0400, Behnam B.Marandi , wrote:
>>
>> Hi,
>> I have a 3 node Cassandra cluster (version 3.11.1) on m4.xlarge EC2
>> instances with separate EBS volumes for root (gp2), data (gp2) and
>> commitlog (io1).
>> I get daily outbound traffic at a certain time everyday. As you can see
>> in the attached screenshot, whiile my normal networkl oad hardly meets
>> 200MB, this outbound (orange) spikes up to 2GB while inbound (purple) is
>> less than 800MB.
>> There is no repair or backup process giong on in that time window, so I
>> am wondering where to look. Any idea?
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>

Re: about cassandra..

2018-08-09 Thread Elliott Sims

Deflate instead of LZ4 will probably give you somewhat better compression
at the cost of a lot of CPU.  Larger chunk length might also help, but in
most cases you probably won't see much benefit above 64K (and it will
increase I/O load).

On Wed, Aug 8, 2018 at 11:18 PM, Eunsu Kim  wrote:

> Hi all.
>
> I’m worried about the amount of disk I use, so I’m more curious about
> compression. We are currently using 3.11.0 and use default LZ4 Compressor
> ('chunk_length_in_kb': 64).
> Is there a setting that can make more powerful compression?
> Because most of them are time series data with TTL, we use
> TimeWindowCompositionStrategy.
>
> Thank you in advance.
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Too many Cassandra threads waiting!!!

2018-08-01 Thread Elliott Sims

You might have more luck trying to analyze at the Java level, either via a
(Java) stack dump and the "ttop" tool from Swiss Java Knife, or Cassandra
tools like "nodetool tpstats"

On Wed, Aug 1, 2018 at 2:08 AM, nokia ceph  wrote:

> Hi,
>
> i'm having a 5 node cluster with cassandra 3.0.13.
>
> i could see the cassandra process has too many threads.
>
> *# pstree -p `pgrep java` | wc -l*
> *453*
>
> And almost all of those threads are in *sleeping* state and wait at
> *# cat  /proc/166022/task/1698913/wchan*
> *futex_wait_queue_me*
>
> Some more info:
> *# strace -e trace=all -p 166022*
> *strace: Process 166022 attached*
> *futex(0x7efc24aeb9d0, FUTEX_WAIT, 166023, NULL*
>
> # cat /proc/166022/stack
> [] futex_wait_queue_me+0xc6/0x130
> [] futex_wait+0x17b/0x280
> [] do_futex+0x106/0x5a0
> [] SyS_futex+0x80/0x180
> [] system_call_fastpath+0x16/0x1b
> [] 0x
>
>
> What is the reason cassandra is having these many threads? is it the
> normal behavior of cassandra?  Is there a way to reduce this thread count?
> will there be any performance impact because of this (our platform experts
> suspects so)?
>
> Regards,
> Renoy  Paulose
>
>

Re: Network throughput requirements

2018-07-10 Thread Elliott Sims

Among the hosts in a cluster?  It depends on how much data you're trying to
read and write.  In general, you're going to want a lot more bandwidth
among hosts in the cluster than you have external-facing.  Otherwise things
like repairs and bootstrapping new nodes can get slow/difficult.  To put it
in perspective, by default it's configured to use up to 200Mbps output
streaming traffic per source node (which might mean a multiple of that
incoming to one node in some cases).

What specifically are you trying to size?  If it's NICs on the hosts, 1Gbps
will be OK for low load but a bit of a bottleneck for higher-traffic
clusters.  10Gbps will probably be more than Cassandra can saturate even
with some tuning.  Or are you trying to size an overall LAN?  Same general
idea, but be aware that the traffic sort of comes in "waves" with repairs
and bootstrapping.  Or are you planning on having geographically spread
nodes within a cluster and want to know how big of a WAN link you need?
Putting those in separate logical "datacenters" with multiple replicas per
DC will give you more options in terms of limiting inter-DC traffic.

On Tue, Jul 10, 2018 at 11:14 AM, Justin Sanciangco <
jsancian...@blizzard.com.invalid> wrote:

> Hello,
>
>
>
> What is the general network throughput (mb/s) requirement for Cassandra?
>
>
>
> Thanks in advance for your advise,
>
>
>
> Justin
>

Re: JVM Heap erratic

2018-06-28 Thread Elliott Sims

Odd.  Your "post-GC" heap level seems a lot lower than your max, which
implies that you should be OK with ~10GB.  I'm guessing either you're
genuinely getting a huge surge in needed heap and running out, or it's
falling behind and garbage is building up.  If the latter, there might be
some tweaking you can do.  Probably worth turning on GC logging and digging
through exactly what's happening.

CMS is kind of hard to tune and can have problems with heap fragmentation
since it doesn't compact, but if it's working for you I'd say stick with it.

On Thu, Jun 28, 2018 at 3:14 PM, Randy Lynn  wrote:

> Thanks for the feedback..
>
> Getting tons of OOM lately..
>
> You mentioned overprovisioned heap size... well...
> tried 8GB = OOM
> tried 12GB = OOM
> tried 20GB w/ G1 = OOM (and long GC pauses usually over 2 secs)
> tried 20GB w/ CMS = running
>
> we're java 8 update 151.
> 3.11.1.
>
> We've got one table that's got a 400MB partition.. that's the max.. the
> 99th is < 100MB, and 95th < 30MB..
> So I'm not sure that I'm overprovisioned, I'm just not quite yet to the
> heap size based on our partition sizes.
> All queries use cluster key, so I'm not accidentally reading a whole
> partition.
> The last place I'm looking - which maybe should be the first - is
> tombstones.
>
> sorry for the afternoon rant! thanks for your eyes!
>
> On Thu, Jun 28, 2018 at 5:54 PM, Elliott Sims 
> wrote:
>
>> It depends a bit on which collector you're using, but fairly normal.
>> Heap grows for a while, then the JVM decides via a variety of metrics that
>> it's time to run a collection.  G1GC is usually a bit steadier and less
>> sawtooth than the Parallel Mark Sweep , but if your heap's a lot bigger
>> than needed I could see it producing that pattern.
>>
>> On Thu, Jun 28, 2018 at 9:23 AM, Randy Lynn  wrote:
>>
>>> I have datadog monitoring JVM heap.
>>>
>>> Running 3.11.1.
>>> 20GB heap
>>> G1 for GC.. all the G1GC settings are out-of-the-box
>>>
>>> Does this look normal?
>>>
>>> https://drive.google.com/file/d/1hLMbG53DWv5zNKSY88BmI3Wd0ic
>>> _KQ07/view?usp=sharing
>>>
>>> I'm a C# .NET guy, so I have no idea if this is normal Java behavior.
>>>
>>>
>>>
>>> --
>>> Randy Lynn
>>> rl...@getavail.com
>>>
>>> office:
>>> 859.963.1616 <+1-859-963-1616> ext 202
>>> 163 East Main Street - Lexington, KY 40507 - USA
>>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>>
>>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>>
>>
>>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>
> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>

Re: JVM Heap erratic

2018-06-28 Thread Elliott Sims

It depends a bit on which collector you're using, but fairly normal.  Heap
grows for a while, then the JVM decides via a variety of metrics that it's
time to run a collection.  G1GC is usually a bit steadier and less sawtooth
than the Parallel Mark Sweep , but if your heap's a lot bigger than needed
I could see it producing that pattern.

On Thu, Jun 28, 2018 at 9:23 AM, Randy Lynn  wrote:

> I have datadog monitoring JVM heap.
>
> Running 3.11.1.
> 20GB heap
> G1 for GC.. all the G1GC settings are out-of-the-box
>
> Does this look normal?
>
> https://drive.google.com/file/d/1hLMbG53DWv5zNKSY88BmI3Wd0ic_
> KQ07/view?usp=sharing
>
> I'm a C# .NET guy, so I have no idea if this is normal Java behavior.
>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
> 
>
>  getavail.com 
>

Re: High load, low IO wait, moderate CPU usage

2018-06-15 Thread Elliott Sims

Do you have an actual performance issue anywhere at the application level?
If not, I wouldn't spend too much time on it - load avg is a sort of odd
indirect metric that may or may not mean anything depending on the
situation.

On Fri, Jun 15, 2018 at 6:49 AM, Igor Leão  wrote:

> Hi there,
>
> I have a Cassandra cluster running on Kubernetes. This cluster has 8
> running instances with 8Gb of memory and 5 CPU cores. I can see a high
> load avg in multiple instances, but no IO wait and moderate CPU usage.
>
> Do you know how I can solve this issue?
>
> Best,
> Igor
>

Re: saving distinct data in cassandra result in many tombstones

2018-06-12 Thread Elliott Sims

If this is data that expires after a certain amount of time, you probably
want to look into using TWCS and TTLs to minimize the number of tombstones.

Decreasing gc_grace_seconds then compacting will reduce the number of
tombstones, but at the cost of potentially resurrecting deleted data if the
table hasn't been repaired during the grace interval.  You can also just
increase the tombstone thresholds, but the queries will be pretty
expensive/wasteful.

On Tue, Jun 12, 2018 at 2:02 AM, onmstester onmstester 
wrote:

> Hi,
>
> I needed to save a distinct value for a key in each hour, the problem with
> saving everything and computing distincts in memory is that there
> are too many repeated data.
> Table schema:
> Table distinct(
> hourNumber int,
> key text,
> distinctValue long
> primary key (hourNumber)
> )
>
> I want to retrieve distinct count of all keys in a specific hour and using
> this data model it would be achieved by reading a single partition.
> The problem : i can't read from this table, system.log indicates that more
> than 100K tombstones read and no live data in it. The gc_grace time is
> the default (10 days), so i thought decreasing it to 1 hour and run
> compaction, but is this a right approach at all? i mean the whole idea of
> replacing
> some millions of rows. each  10 times in a partition again and again that
> creates alot of tombstones just to achieve distinct behavior?
>
> Thanks in advance
>
> Sent using Zoho Mail 
>
>
>

Re: Restoring snapshot

2018-06-11 Thread Elliott Sims

It's possible that it's something more subtle, but keep in mind that
sstables don't include the schema.  If you've made schema changes, you need
to apply/revert those first or C* probably doesn't know what to do with
those columns in the sstable.

On Sun, Jun 10, 2018 at 11:38 PM,  wrote:

> Dear Community,
>
>
>
> I’ll appreciate if I can get some responses to the observation below:
>
>
>
> https://stackoverflow.com/q/50763067/5701173
>
>
>
> Thanks and regards,
>
> Vishal Sharma
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>

Re: 3.11.2 memory leak

2018-06-04 Thread Elliott Sims

Are you seeing significant issues in terms of performance?  Increased
garbage collection, long pauses, or even OutOfMemory?  Which garbage
collector are you using and with what settings/thresholds?  Since the JVM's
garbage-collected, a bigger heap can mean a problem or it can just mean
"hasn't gotten big enough for the collector to bother doing any work"

If it's genuinely having memory/heap pressure problems, it's probably worth
getting a heap dump and poking through it to see what's using the space.
For a heap that big, you'll probably need to run the Eclipse MAT CLI tools
against it then open the result in the GUI.

On Mon, Jun 4, 2018 at 6:52 AM, Abdul Patel  wrote:

> Hi All,
>
> I recently upgraded my non prod cluster from 3.10 to 3.11.2.
> It was working fine for a 1.5 weeks then suddenly nodetool info startee
> reporting 80% and more memory consumption.
> Intially it was 16gb configured, then i bumped to 20gb and rebooted all 4
> nodes of cluster-single DC.
> Now after 8 days i again see 80% + usage and its 16gb and above ..which we
> never saw before .
> Seems like memory leak bug?
> Does anyone has any idea ? Our 3.11.2 release rollout has been halted
> because of this.
> If not 3.11.2 whats the next best stable release we have now?
>

Re: Mongo DB vs Cassandra

2018-06-01 Thread Elliott Sims

I'd say for a large write-heavy workload like, Cassandra is a pretty clear
winner over MongoDB.  I agree with the commenters about understanding your
query patterns a bit better before choosing, though.  Cassandra's queries
are a bit limited, and if you're loading all new data every day and
discarding the old you might run into some significant tombstone issues.

It's worth looking into various other storage systems depending on your
exact needs, like S3, B2 (OK, I'm biased there), or possibly Spark or
Hadoop.  Cassandra's phenomenal at scaling to large write workloads, but
the data and query model isn't well-suited to all applications. It can also
be a bit... administration-intensive, though the same can be said about
MongoDB and Hadoop.

On Thu, May 31, 2018 at 11:17 AM, Joseph Arriola 
wrote:

> Based on the metrics you say, I think the big data architecture can be:
> cassandra with spark. you mention high availability. the apis could use
> node.js. This combination is powerful, the challenge is in the data model.
>
> On the other hand, if you are willing to sacrifice high availability and
> slow response time, mongodb can be easier to implement.
>
>
>
> El El jue, 31 de may. de 2018 a las 10:01 a. m., Sudhakar Ganesan <
> sudhakar.gane...@flex.com.invalid> escribió:
>
>> At high level, in the production line, machine will provide the data in
>> the form of CSV in every 1 sec to 1 minutes to 1 day ( depending on machine
>> type used in the line operations). I need to parse those files and load it
>> to DB and build and API layer expose it to downstream systems.
>>
>>
>>
>> *Number of files to be processed   13,889,660,134  per day*
>>
>> *Each file could range from 20 KB to 600MB which will translate into few
>> hundred rows to millions of rows.*
>>
>> *High availability with high write. Read is less compare to write.*
>>
>> *While extracting the rows, few validation to be performed.*
>>
>> *Build an API layer on top of the data to be persisted in the DB.*
>>
>>
>>
>> Now, tell me what would be the best choice…
>>
>>
>>
>> *From:* Russell Bateman [mailto:r...@windofkeltia.com]
>> *Sent:* Thursday, May 31, 2018 7:36 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Mongo DB vs Cassandra
>>
>>
>>
>> Sudhakar,
>>
>> MongoDB will accommodate loading CSV without regard to schema while still
>> creating identifiable "columns" in the database, but you'll have to predict
>> or back-impose some schema later if you're going to create indices for fast
>> searching of the data. You can perform searching of data without indexing
>> in MongoDB, but it's slower.
>>
>> Cassandra will require you to understand the schema, i.e.: what the
>> columns are up front unless you're just going to store the data without
>> schema and, therefore, without ability to search effectively.
>>
>> As suggested already, you should share more detail if you want good
>> advice. Both DBs are excellent. Both do different things in different ways.
>>
>> Hope this helps,
>> Russ
>>
>> On 05/31/2018 05:49 AM, Sudhakar Ganesan wrote:
>>
>> Team,
>>
>>
>>
>> I need to make a decision on Mongo DB vs Cassandra for loading the csv
>> file data and store csv file as well. If any of you did such study in last
>> couple of months, please share your analysis or observations.
>>
>>
>>
>> Regards,
>>
>> Sudhakar
>>
>> Legal Disclaimer :
>> The information contained in this message may be privileged and
>> confidential.
>> It is intended to be read only by the individual or entity to whom it is
>> addressed
>> or by their designee. If the reader of this message is not the intended
>> recipient,
>> you are on notice that any distribution of this message, in any form,
>> is strictly prohibited. If you have received this message in error,
>> please immediately notify the sender and delete or destroy any copy of
>> this message!
>>
>>
>>
>

Re: Snapshot SSTable modified??

2018-05-28 Thread Elliott Sims

Unix timestamps are a bit odd.  "mtime/Modify" is file changes,
"ctime/Change/(sometimes called create)" is file metadata changes, and a
link count change is a metadata change.  This seems like an odd decision on
the part of GNU tar, but presumably there's a good reason for it.

When the original sstable is compacted away, it's removed and therefore the
link count on the snapshot file is decremented.  The file's contents
haven't changed so mtime is identical, but ctime does get updated.  BSDtar
doesn't seem to interpret link count changes as a file change, so it's
pretty effective as a workaround.



On Fri, May 25, 2018 at 8:00 PM, Max C  wrote:

> I looked at the source code for GNU tar, and it looks for a change in the
> create time or (more likely) a change in the size.
>
> This seems very strange to me — I would think that creating a snapshot
> would cause a flush and then once the SSTables are written, hardlinks would
> be created and the SSTables wouldn't be written to after that.
>
> Our solution is to wait 5 minutes and retry the tar if an error occurs.
> This isn't ideal - but it's the best I could come up with.  :-/
>
> Thanks Jeff & others for your responses.
>
> - Max
>
> On May 25, 2018, at 5:05pm, Elliott Sims  wrote:
>
> I've run across this problem before - it seems like GNU tar interprets
> changes in the link count as changes to the file, so if the file gets
> compacted mid-backup it freaks out even if the file contents are
> unchanged.  I worked around it by just using bsdtar instead.
>
> On Thu, May 24, 2018 at 6:08 AM, Nitan Kainth 
> wrote:
>
>> Jeff,
>>
>> Shouldn't Snapshot get consistent state of sstables? -tmp file shouldn't
>> impact backup operation right?
>>
>>
>> Regards,
>> Nitan K.
>> Cassandra and Oracle Architect/SME
>> Datastax Certified Cassandra expert
>> Oracle 10g Certified
>>
>> On Wed, May 23, 2018 at 6:26 PM, Jeff Jirsa  wrote:
>>
>>> In versions before 3.0, sstables were written with a -tmp filename and
>>> copied/moved to the final filename when complete. This changes in 3.0 - we
>>> write into the file with the final name, and have a journal/log to let uss
>>> know when it's done/final/live.
>>>
>>> Therefore, you can no longer just watch for a -Data.db file to be
>>> created and uploaded - you have to watch the log to make sure it's not
>>> being written.
>>>
>>>
>>> On Wed, May 23, 2018 at 2:18 PM, Max C.  wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> We’ve noticed a few times in the last few weeks that when we’re doing
>>>> backups, tar has complained with messages like this:
>>>>
>>>> tar: /var/lib/cassandra/data/mars/test_instances_by_test_id-6a944
>>>> 0a04cc111e8878675f1041d7e1c/snapshots/backup_20180523_024502/mb-63-big-Data.db:
>>>> file changed as we read it
>>>>
>>>> Any idea what might be causing this?
>>>>
>>>> We’re running Cassandra 3.0.8 on RHEL 7.  Here’s rough pseudocode of
>>>> our backup process:
>>>>
>>>> 
>>>> SNAPSHOT_NAME=backup_YYYMMDD_HHMMSS
>>>> nodetool snapshot -t $SNAPSHOT_NAME
>>>>
>>>> for each keyspace
>>>> - dump schema to “schema.cql"
>>>> - tar -czf /file_server/backup_$HOSTNAME_$KEYSPACE_MMDD_HHMMSS.tgz
>>>> schema.cql /var/lib/cassandra/data/$KEYSPACE/*/snapshots/$SNAPSHOT_NAME
>>>>
>>>> nodetool clearsnapshot -t $SNAPSHOT_NAME
>>>>
>>>> Thanks.
>>>>
>>>> - Max
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>>
>>>>
>>>
>>
>
>

Re: Snapshot SSTable modified??

2018-05-25 Thread Elliott Sims

I've run across this problem before - it seems like GNU tar interprets
changes in the link count as changes to the file, so if the file gets
compacted mid-backup it freaks out even if the file contents are
unchanged.  I worked around it by just using bsdtar instead.

On Thu, May 24, 2018 at 6:08 AM, Nitan Kainth  wrote:

> Jeff,
>
> Shouldn't Snapshot get consistent state of sstables? -tmp file shouldn't
> impact backup operation right?
>
>
> Regards,
> Nitan K.
> Cassandra and Oracle Architect/SME
> Datastax Certified Cassandra expert
> Oracle 10g Certified
>
> On Wed, May 23, 2018 at 6:26 PM, Jeff Jirsa  wrote:
>
>> In versions before 3.0, sstables were written with a -tmp filename and
>> copied/moved to the final filename when complete. This changes in 3.0 - we
>> write into the file with the final name, and have a journal/log to let uss
>> know when it's done/final/live.
>>
>> Therefore, you can no longer just watch for a -Data.db file to be created
>> and uploaded - you have to watch the log to make sure it's not being
>> written.
>>
>>
>> On Wed, May 23, 2018 at 2:18 PM, Max C.  wrote:
>>
>>> Hi Everyone,
>>>
>>> We’ve noticed a few times in the last few weeks that when we’re doing
>>> backups, tar has complained with messages like this:
>>>
>>> tar: /var/lib/cassandra/data/mars/test_instances_by_test_id-6a944
>>> 0a04cc111e8878675f1041d7e1c/snapshots/backup_20180523_024502/mb-63-big-Data.db:
>>> file changed as we read it
>>>
>>> Any idea what might be causing this?
>>>
>>> We’re running Cassandra 3.0.8 on RHEL 7.  Here’s rough pseudocode of our
>>> backup process:
>>>
>>> 
>>> SNAPSHOT_NAME=backup_YYYMMDD_HHMMSS
>>> nodetool snapshot -t $SNAPSHOT_NAME
>>>
>>> for each keyspace
>>> - dump schema to “schema.cql"
>>> - tar -czf /file_server/backup_$HOSTNAME_$KEYSPACE_MMDD_HHMMSS.tgz
>>> schema.cql /var/lib/cassandra/data/$KEYSPACE/*/snapshots/$SNAPSHOT_NAME
>>>
>>> nodetool clearsnapshot -t $SNAPSHOT_NAME
>>>
>>> Thanks.
>>>
>>> - Max
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>
>>
>

Re: Cassandra HEAP Suggestion.. Need a help

2018-05-24 Thread Elliott Sims

JVM GC tuning can be pretty complex, but the simplest solution to OOM is
probably switching to G1GC and feeding it a rather large heap.
Theoretically a smaller heap and carefully-tuned CMS collector is more
efficient, but CMS is kind of fragile and tuning it is more of a black art,
where you can generally get into a state of "good enough" with G1 and a
bigger heap as long as there's physically enough RAM.

If you're on 2.x I'd strongly advise updating to 3 (probably 3.11.x), as
there were some pretty significant improvement in memory allocation.  3.11
also lets you move some things off-heap.

On Thu, May 10, 2018, 10:23 PM Jeff Jirsa  wrote:

> There's no single right answer. It depends a lot on the read/write
> patterns and other settings (onheap memtable, offheap memtable, etc).
>
> One thing that's probably always true, if you're using ParNew/CMS, 16G
> heap is a bit large, but may be appropriate for some read heavy workloads,
> but you'd want to make sure you start CMS earlier than default (set CMS
> initiating occupancy lower than default). May find it easier to do
> something like 12/3 or 12/4, and leave the remaining RAM for page cache.
>
> CASSANDRA-8150 has a bunch of notes for tuning GC configs (
> https://issues.apache.org/jira/browse/CASSANDRA-8150 ), and Amy's 2.1
> tuning guide is pretty solid too (
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html )
>
>
>
>
>
> On Fri, May 11, 2018 at 10:30 AM, Mokkapati, Bhargav (Nokia - IN/Chennai)
>  wrote:
>
>> Hi Team,
>>
>>
>>
>> I have 64GB of total system memory. 5 node cluster.
>>
>>
>>
>> x ~# free -m
>>
>>   totalusedfree  shared  buff/cache
>> available
>>
>> Mem:  64266   17549   41592  665124
>> 46151
>>
>> Swap: 0   0   0
>>
>> x ~#
>>
>>
>>
>> and “egrep -c 'processor([[:space:]]+):.*' /proc/cpuinfo” giving 12 cpu
>> cores.
>>
>>
>>
>> Currently Cassandra-env.sh calculating MAX_HEAP_SIZE as ‘8GB’ and
>> HEAP_NEWSIZE as ‘1200 MB’
>>
>>
>>
>> I am facing Java insufficient memory issue and Cassandra service is
>> getting down.
>>
>>
>>
>> I going to hard code the HEAP values in Cassandra-env.sh as below.
>>
>>
>>
>> MAX_HEAP_SIZE="16G"  (1/4 of total RAM)
>>
>> HEAP_NEWSIZE="4G" (1/4 of MAX_HEAP_SIZE)
>>
>>
>>
>> Is these values correct for my setup in production? Is there any
>> disadvantages doing this?
>>
>>
>>
>> Please let me know if any of you people faced the same issue.
>>
>>
>>
>> Thanks in advance!
>>
>>
>>
>> Best regards,
>>
>> Bhargav M
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Cassandra downgrade version

2018-04-25 Thread Elliott Sims

Looks like no major table version changes since 3.0, and a couple of minor
changes in 3.0.7/3.7 and 3.0.8/3.8:
https://github.com/apache/cassandra/blob/48a539142e9e318f9177ad8cec4781
9d1adc3df9/doc/source/architecture/storage_engine.rst

So, I suppose whether a revert is safe or not depends on whether the "mb"
and "mc" table format changes were backported to 3.1.0.
Looking at https://github.com/apache/cassandra/blob/
e092873728dc88aebc6ee10153b9bd3cd90cd858/src/java/org/
apache/cassandra/io/sstable/format/big/BigFormat.java#L112 I'd say it looks
like the "mb" and "mc" table formats were not backported to 3.1.0, so
downgrading is probably dangerous.

Potentially, if you don't need any of the features in 3.1.0 vs 3.0 you
could "downgrade" to 3.0.16 as a "safer" change that would get you
compatible with "mc" formatted sstables.  At that point you can probably go
back and forth between 3.0.16 and 3.11.2 safely.  I'm not sure this is
actually any safer than jumping directly to 3.11.2 though, and it's
definitely a lot more complicated.

This is pure speculation on my part, but given that the changes in mb and
mc aren't major row format changes, downgrading *might* still work.  Or,
potentially, you might be able to use a newer sstableloader to load the
mc-format files into 3.1.0 successfully.  If you go down this road, test
the cross-version loading thoroughly first.  It's probably not the best
plan, but if you get stuck it might be useful.

All that said... 3.11.2 is probably a strict improvement over 3.1.0 in
features, stability, and performance.  I'd lean towards testing as much as
possible then just rolling forwards.

79 matches

Mail list logo