from:"Bryan Cheng"

Re: Node failure Due To Very high GC pause time

2017-07-03 Thread Bryan Cheng

This is a very antagonistic use case for Cassandra :P I assume you're
familiar with Cassandra and deletes? (eg.
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html,
http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_about_deletes_c.html
)

That being said, are you giving enough time for your tables to flush to
disk? Deletes generate markers which can and will consume memory until they
have a chance to be flushed, after which they will impact query time and
performance (but should relieve memory pressure). If you're saturating the
capability of your nodes your tables will have difficulty flushing. See
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_memtable_thruput_c.html
.

This could also be a heap/memory configuration issue as well or a GC tuning
issue (although unlikely if you've left those at default)

--Bryan

On Mon, Jul 3, 2017 at 7:51 AM, Karthick V  wrote:

> Hi,
>
>   Recently In my test Cluster I faced a outrageous GC activity which
> made the Node unreachable inside the cluster itself.
>
> Scenario :
>   In a Partition of 5Million rows we read first 500 (by giving the
> starting range) and delete the same 500 again.The same has been done
> recursively by changing the Start range alone. Initially I didn't see any
> difference in the query performance ( upto 50,000) but later I observed a
> significant increase in performance when reached about a 3.3Million the
> read request failed and the node went unreachable. After analysing my GC
> logs it is clear that 99% of my old-memory space is occupied and there are
> no more space for allocation it caused the machine stall.
>here my is doubt is that does all the deleted 3.3Million row will
> be loaded in my on-heap memory? if not what will be object that occupying
> those memory ?.
>
> PS : I am using C* 2.1.13 in cluster.
>
>
>
>
>

Re: Does Too many GC pauses can cause cassandra service DOWN.

2017-02-14 Thread Bryan Cheng

GC can absolutely cause a server to get marked down by a peer. See
https://support.datastax.com/hc/en-us/articles/204226199-Common-Causes-of-GC-pauses

As for tuning again we use CMS but this thread has some good G1 info that I
looked at while evaluating it:
https://issues.apache.org/jira/browse/CASSANDRA-7486

Also Al's awesome tuning guide has some stuff, although its 2.1-centric so
might not be 100% applicable
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

On Tue, Feb 14, 2017 at 1:57 PM, Pranay akula 
wrote:

> On couple of servers  when cassandra service is down all i see only GC
> pauses and load on these servers is not high when pauses happend, i haven't
> found any thing else is that can be the reason or do i need to look at
> something else ??
>
> We are using G1GC any reference on how to adjust G1GC settings, as we can
> not set New Heap in G1.
>
>
> Thanks
>  Pranay.
>

Re: inconsistent results

2017-02-14 Thread Bryan Cheng

Change your consistency levels in the cqlsh shell while you query, from ONE
to QUORUM to ALL. If you see your results change that's a consistency
issue. (Assuming these are simple inserts, if there's deletes, potentially
update collections, etc. in the mix then things get a bit more complex.)

To diagnose why the issue exists, a helpful metric are the various dropped
messages metrics from nodetool tpstats. Overloaded clusters will experience
consistency issues as a result of dropped mutations.

It's helpful to think of things in terms of guarantees. If you write with
CL=ONE or LOCAL_ONE, you're getting exactly one guaranteed write. In a
healthy system with tons of excess capacity, you will likely see much
better consistency than that; the hint system will replicate the write to
other nodes, which will perform the write if they can. Since it appears
you're seeing inconsistency at CL=ONE, plus timeouts at CL=QUORUM, it's
quite likely your cluster is not capable of keeping up with the consistency
level you require.

Why your cluster is overloaded is another question entirely, but if you
discover that's the case in my experience the most common cases are
excessive GC due to bad heap settings and data model issues that cause
massive partitions.

On Tue, Feb 14, 2017 at 2:03 PM, Josh England  wrote:

> I suspect this is true, but it has proven to be significantly harder to
> track down.  Either cassandra is tickling some bug that nothing else does
> or something strange is going on internally.  On an otherwise quiet system,
> I'd see instant results most of the time intermixed with queries (reads)
> that would timeout and fail.  I agree this needs to be addressed but I'd
> like to understand what is currently going on with my queries.  If it is
> thought to be a consistency problem, how can that be verified?
>
> -JE
>
>
> On Tue, Feb 14, 2017 at 1:46 PM, Jonathan Haddad 
> wrote:
>
>> If you're getting a lot of timeouts you will almost certainly end up with
>> consistency issues. You're going to need to fix the root cause, your
>> cluster instability, or this sort of issue will be commonplace.
>>
>>
>> On Tue, Feb 14, 2017 at 1:43 PM Josh England  wrote:
>>
>>> I'll try it the repair.  Using quorum tends to lead to too many timeout
>>> problems though.  :(
>>>
>>> -JE
>>>
>>>
>>> On Tue, Feb 14, 2017 at 1:39 PM, Oskar Kjellin 
>>> wrote:
>>>
>>> Repair might help. But you will end up in this situation again unless
>>> you read/write using quorum (may be local)
>>>
>>> Sent from my iPhone
>>>
>>> On 14 Feb 2017, at 22:37, Josh England  wrote:
>>>
>>> All client interactions are from python (python-driver 3.7.1) using
>>> default consistency (LOCAL_ONE I think).  Should I try repairing all nodes
>>> to make sure all data is consistent?
>>>
>>> -JE
>>>
>>>
>>> On Tue, Feb 14, 2017 at 1:32 PM, Oskar Kjellin 
>>> wrote:
>>>
>>> What consistency levels are you using for reads/writes?
>>>
>>> Sent from my iPhone
>>>
>>> > On 14 Feb 2017, at 22:27, Josh England  wrote:
>>> >
>>> > I'm running Cassandra 3.9 on CentOS 6.7 in a 6-node cluster.  I've got
>>> a situation where the same query sometimes returns 2 records (correct), and
>>> sometimes only returns 1 record (incorrect).  I've ruled out the
>>> application and the indexing since this is reproducible directly from a
>>> cqlsh shell with a simple select statement.  What is the best way to debug
>>> what is happening here?
>>> >
>>> > -JE
>>> >
>>>
>>>
>>>
>>>
>

Re: Metric to monitor partition size

2017-01-13 Thread Bryan Cheng

We're on 2.X so this information may not apply to your version, but you
should see:

1) A log statement upon compaction, like "Writing large partition",
including the primary partition key (see
https://issues.apache.org/jira/browse/CASSANDRA-9643). Configurable
threshold in cassandra.yaml

2) Problematic partition distributions in nodetool cfhistograms, although
without the primary partition key

3) Potentially large partitions in sstables themselves using sstable
parsing utilities. There's also a patch for sstablekeys here, but I've
never used it (https://issues.apache.org/jira/browse/CASSANDRA-8720)

While you _could_  monitor partitions and stop writing to that partition
key when the size reaches a certain threshold (roughly acquired through a
method like above) I'm struggling to think of a case where you'd actually
want to do that- pushing partitions to some maximum size is generally not a
great idea. Ideally you'd want your partitions as small as you can manage
them without making your queries absolutely neurotic.

On Thu, Jan 12, 2017 at 6:08 AM, Saumitra S 
wrote:

> Is there any metric or way to find out if any partition has grown beyond a
> certain size or certain row count?
>
> If a partition reaches a certain size or limit, I want to stop sending
> further write requests to it. Is it possible?
>
>
>

Re: Backup restore with a different name

2016-11-02 Thread Bryan Cheng

Hi Jens,

When you refer to restoring a snapshot for a developer to look at, do you
mean restoring the cluster to that state, or just exposing that state for
reference while keeping the (corrupt) current state in the live cluster?

You may find these useful:
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_backup_snapshot_restore_t.html
https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html

Additionally, AFAIK the snapshot files are just SSTables, so any utility
that can examine them (for example, sstable2json) should work on those
files as well.

On Wed, Nov 2, 2016 at 2:20 PM, Jens Rantil  wrote:

> Hi,
>
> Let's say I am periodically making snapshots of a table, say "users", for
> backup purposes. Let's say a developer makes a mistake and corrupts the
> table. Is there an easy way for me to restore a replica, say
> "users_20161102", of the original table for the developer to looks at the
> old copy?
>
> Cheers,
> Jens
>
> --
> Jens Rantil
> Backend engineer
> Tink AB
>
> Email: jens.ran...@tink.se
> Phone: +46 708 84 18 32
> Web: www.tink.se
>
> Facebook  Linkedin
> 
>  Twitter 
>

Re: Incremental repairs in 3.0

2016-09-06 Thread Bryan Cheng

HI Jean,

This blog post is a pretty good resource:
http://www.datastax.com/dev/blog/anticompaction-in-cassandra-2-1

I believe in 2.1.x you don't need to do the manual migration procedure, but
if you run regular repairs and the data set under LCS is fairly large (what
this means will probably depend on your data model and hardware/cluster
makeup) you can take advantage of a full repair to make anticompaction a
bit easier. What we observed was the anticompaction procedure taking longer
than a standard full repair and with a higher load on the cluster while
running.

On Tue, Sep 6, 2016 at 2:00 AM, Jean Carlo <jean.jeancar...@gmail.com>
wrote:

> Hi @Bryan
>
> When you said "sizable amount of data" you meant a huge amount of data
> right? Our big table is in LCS and if we use the migration process we will
> need to run a repair seq over this table for a long time.
>
> We are planning to go to repairs inc using the version 2.1.14
>
>
> Saludos
>
> Jean Carlo
>
> "The best way to predict the future is to invent it" Alan Kay
>
> On Tue, Jun 21, 2016 at 4:34 PM, Vlad <qa23d-...@yahoo.com> wrote:
>
>> Thanks for answer!
>>
>> >It may still be a good idea to manually migrate if you have a sizable
>> amount of data
>> No, it would be brand new ;-) 3.0 cluster
>>
>>
>>
>> On Tuesday, June 21, 2016 1:21 AM, Bryan Cheng <br...@blockcypher.com>
>> wrote:
>>
>>
>> Sorry, meant to say "therefore manual migration procedure should be
>> UNnecessary"
>>
>> On Mon, Jun 20, 2016 at 3:21 PM, Bryan Cheng <br...@blockcypher.com>
>> wrote:
>>
>> I don't use 3.x so hopefully someone with operational experience can
>> chime in, however my understanding is: 1) Incremental repairs should be the
>> default in the 3.x release branch and 2) sstable repairedAt is now properly
>> set in all sstables as of 2.2.x for standard repairs and therefore manual
>> migration procedure should be necessary. It may still be a good idea to
>> manually migrate if you have a sizable amount of data and are using LCS as
>> anticompaction is rather painful.
>>
>> On Sun, Jun 19, 2016 at 6:37 AM, Vlad <qa23d-...@yahoo.com> wrote:
>>
>> Hi,
>>
>> assuming I have new, empty Cassandra cluster, how should I start using
>> incremental repairs? Is incremental repair is default now (as I don't see
>> *-inc* option in nodetool) and nothing is needed to use it, or should we
>> perform migration procedure
>> <http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesMigration.html>
>> anyway? And what happens to new column families?
>>
>> Regards.
>>
>>
>>
>>
>>
>>
>

Re: Corrupt SSTABLE over and over

2016-08-15 Thread Bryan Cheng

Hi Alaa,

Sounds like you have problems that go beyond Cassandra- likely filesystem
corruption or bad disks. I don't know enough about Windows to give you any
specific advice but I'd try a run of chkdsk to start.

--Bryan

On Fri, Aug 12, 2016 at 5:19 PM, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.com>
wrote:

> Hi Bryan,
>
> Changing disk_failure_policy to best_effort, and running nodetool scrub,
> did not work, it generated another error:
> java.nio.file.AccessDeniedException
>
> Also tried to remove all files (data, commitlog, savedcaches) and restart
> the node fresh, and still I am getting corruption.
>
> and Still nothing that indicate there is a HW issue?
> All other nodes are fine
>
> Regards,
> Alaa
>
>
> On Fri, Aug 12, 2016 at 12:00 PM, Bryan Cheng <br...@blockcypher.com>
> wrote:
>
>> Should also add that if the scope of corruption is _very_ large, and you
>> have a good, aggressive repair policy (read: you are confident in the
>> consistency of the data elsewhere in the cluster), you may just want to
>> decommission and rebuild that node.
>>
>> On Fri, Aug 12, 2016 at 11:55 AM, Bryan Cheng <br...@blockcypher.com>
>> wrote:
>>
>>> Looks like you're doing the offline scrub- have you tried online?
>>>
>>> Here's my typical process for corrupt SSTables.
>>>
>>> With disk_failure_policy set to stop, examine the failing sstables. If
>>> they are very small (in the range of kbs), it is unlikely that there is any
>>> salvageable data there. Just delete them, start the machine, and schedule a
>>> repair ASAP.
>>>
>>> If they are large, then it may be worth salvaging. If the scope of
>>> corruption is reasonable (limited to a few sstables scattered among
>>> different keyspaces), set disk_failure_policy to best_effort, start the
>>> machine up, and run the nodetool scrub. This is online scrub, faster than
>>> offline scrub (at least of 2.1.12, the last time I had to do this).
>>>
>>> Only if all else fails, attempt the very painful offline sstablescrub.
>>>
>>> Is the VMWare client Windows? (Trying to make sure its not just the
>>> host). YMMV but in the past Windows was somewhat of a neglected platform
>>> wrt Cassandra. I think you'd have a lot easier time getting help if running
>>> Linux is an option here.
>>>
>>>
>>>
>>> On Fri, Aug 12, 2016 at 9:16 AM, Alaa Zubaidi (PDF) <
>>> alaa.zuba...@pdf.com> wrote:
>>>
>>>> Hi Jason,
>>>>
>>>> Thanks for your input...
>>>> Thats what I am afraid of?
>>>> Did you find any HW error in the VMware and HW logs? any indication
>>>> that the HW is the reason? I need to make sure that this is the reason
>>>> before asking the customer to spend more money?
>>>>
>>>> Thanks,
>>>> Alaa
>>>>
>>>> On Thu, Aug 11, 2016 at 11:02 PM, Jason Wee <peich...@gmail.com> wrote:
>>>>
>>>>> cassandra run on virtual server (vmware)?
>>>>>
>>>>> > I tried sstablescrub but it crashed with hs-err-pid-...
>>>>> maybe try with larger heap allocated to sstablescrub
>>>>>
>>>>> this sstable corrupt i ran into it as well (on cassandra 1.2), first i
>>>>> try nodetool scrub, still persist, then offline sstablescrub still
>>>>> persist, wipe the node and it happen again, then i change the hardware
>>>>> (disk and mem). things went good.
>>>>>
>>>>> hth
>>>>>
>>>>> jason
>>>>>
>>>>>
>>>>> On Fri, Aug 12, 2016 at 9:20 AM, Alaa Zubaidi (PDF)
>>>>> <alaa.zuba...@pdf.com> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I have a 16 Node cluster, Cassandra 2.2.1 on Windows, local
>>>>> installation
>>>>> > (NOT on the cloud)
>>>>> >
>>>>> > and I am getting
>>>>> > Error [CompactionExecutor:2] 2016-08-12 06:51:52, 983 Cassandra
>>>>> > Daemon.java:183 - Execption in thread Thread[CompactionExecutor:2,1m
>>>>> ain]
>>>>> > org.apache.cassandra.io.FSReaderError:
>>>>> > org.apache.cassandra.io.sstable.CorruptSSTableExecption:
>>>>> > org.apache.cassandra.io.compress.CurrptBlockException:
>>>>> > (E:\\la-4886-big-Data.db): corruption detected, chunk at
>>>>> 4969092

Re: Corrupt SSTABLE over and over

2016-08-12 Thread Bryan Cheng

Should also add that if the scope of corruption is _very_ large, and you
have a good, aggressive repair policy (read: you are confident in the
consistency of the data elsewhere in the cluster), you may just want to
decommission and rebuild that node.

On Fri, Aug 12, 2016 at 11:55 AM, Bryan Cheng <br...@blockcypher.com> wrote:

> Looks like you're doing the offline scrub- have you tried online?
>
> Here's my typical process for corrupt SSTables.
>
> With disk_failure_policy set to stop, examine the failing sstables. If
> they are very small (in the range of kbs), it is unlikely that there is any
> salvageable data there. Just delete them, start the machine, and schedule a
> repair ASAP.
>
> If they are large, then it may be worth salvaging. If the scope of
> corruption is reasonable (limited to a few sstables scattered among
> different keyspaces), set disk_failure_policy to best_effort, start the
> machine up, and run the nodetool scrub. This is online scrub, faster than
> offline scrub (at least of 2.1.12, the last time I had to do this).
>
> Only if all else fails, attempt the very painful offline sstablescrub.
>
> Is the VMWare client Windows? (Trying to make sure its not just the host).
> YMMV but in the past Windows was somewhat of a neglected platform wrt
> Cassandra. I think you'd have a lot easier time getting help if running
> Linux is an option here.
>
>
>
> On Fri, Aug 12, 2016 at 9:16 AM, Alaa Zubaidi (PDF) <alaa.zuba...@pdf.com>
> wrote:
>
>> Hi Jason,
>>
>> Thanks for your input...
>> Thats what I am afraid of?
>> Did you find any HW error in the VMware and HW logs? any indication that
>> the HW is the reason? I need to make sure that this is the reason before
>> asking the customer to spend more money?
>>
>> Thanks,
>> Alaa
>>
>> On Thu, Aug 11, 2016 at 11:02 PM, Jason Wee <peich...@gmail.com> wrote:
>>
>>> cassandra run on virtual server (vmware)?
>>>
>>> > I tried sstablescrub but it crashed with hs-err-pid-...
>>> maybe try with larger heap allocated to sstablescrub
>>>
>>> this sstable corrupt i ran into it as well (on cassandra 1.2), first i
>>> try nodetool scrub, still persist, then offline sstablescrub still
>>> persist, wipe the node and it happen again, then i change the hardware
>>> (disk and mem). things went good.
>>>
>>> hth
>>>
>>> jason
>>>
>>>
>>> On Fri, Aug 12, 2016 at 9:20 AM, Alaa Zubaidi (PDF)
>>> <alaa.zuba...@pdf.com> wrote:
>>> > Hi,
>>> >
>>> > I have a 16 Node cluster, Cassandra 2.2.1 on Windows, local
>>> installation
>>> > (NOT on the cloud)
>>> >
>>> > and I am getting
>>> > Error [CompactionExecutor:2] 2016-08-12 06:51:52, 983 Cassandra
>>> > Daemon.java:183 - Execption in thread Thread[CompactionExecutor:2,1m
>>> ain]
>>> > org.apache.cassandra.io.FSReaderError:
>>> > org.apache.cassandra.io.sstable.CorruptSSTableExecption:
>>> > org.apache.cassandra.io.compress.CurrptBlockException:
>>> > (E:\\la-4886-big-Data.db): corruption detected, chunk at
>>> 4969092 of
>>> > length 10208.
>>> > at
>>> > org.apache.cassandra.io.util.RandomAccessReader.readBytes(Ra
>>> ndomAccessReader.java:357)
>>> > ~[apache-cassandra-2.2.1.jar:2.2.1]
>>> > 
>>> > 
>>> > ERROR [CompactionExecutor:2] ... FileUtils.java:463 - Existing
>>> > forcefully due to file system exception on startup, disk failure policy
>>> > "stop"
>>> >
>>> > I tried sstablescrub but it crashed with hs-err-pid-...
>>> > I removed the corrupted file and started the Node again, after one day
>>> the
>>> > corruption came back again, I removed the files, and restarted
>>> Cassandra, it
>>> > worked for few days, then I ran "nodetool repair" after it finished,
>>> > Cassandra failed again but with commitlog corruption, after removing
>>> the
>>> > commitlog files, it failed again with another sstable corruption.
>>> >
>>> > I was also checking the HW, file system, and memory, the VMware logs
>>> showed
>>> > no HW error, also the HW management logs showed NO problems or issues.
>>> > Also checked the Windows Logs (Application and System) the only thing I
>>> > found is on the system logs "Cassandra Service terminated with
>>> > service-specific error Cannot cr

Re: Corrupt SSTABLE over and over

2016-08-12 Thread Bryan Cheng

Looks like you're doing the offline scrub- have you tried online?

Here's my typical process for corrupt SSTables.

With disk_failure_policy set to stop, examine the failing sstables. If they
are very small (in the range of kbs), it is unlikely that there is any
salvageable data there. Just delete them, start the machine, and schedule a
repair ASAP.

If they are large, then it may be worth salvaging. If the scope of
corruption is reasonable (limited to a few sstables scattered among
different keyspaces), set disk_failure_policy to best_effort, start the
machine up, and run the nodetool scrub. This is online scrub, faster than
offline scrub (at least of 2.1.12, the last time I had to do this).

Only if all else fails, attempt the very painful offline sstablescrub.

Is the VMWare client Windows? (Trying to make sure its not just the host).
YMMV but in the past Windows was somewhat of a neglected platform wrt
Cassandra. I think you'd have a lot easier time getting help if running
Linux is an option here.



On Fri, Aug 12, 2016 at 9:16 AM, Alaa Zubaidi (PDF) 
wrote:

> Hi Jason,
>
> Thanks for your input...
> Thats what I am afraid of?
> Did you find any HW error in the VMware and HW logs? any indication that
> the HW is the reason? I need to make sure that this is the reason before
> asking the customer to spend more money?
>
> Thanks,
> Alaa
>
> On Thu, Aug 11, 2016 at 11:02 PM, Jason Wee  wrote:
>
>> cassandra run on virtual server (vmware)?
>>
>> > I tried sstablescrub but it crashed with hs-err-pid-...
>> maybe try with larger heap allocated to sstablescrub
>>
>> this sstable corrupt i ran into it as well (on cassandra 1.2), first i
>> try nodetool scrub, still persist, then offline sstablescrub still
>> persist, wipe the node and it happen again, then i change the hardware
>> (disk and mem). things went good.
>>
>> hth
>>
>> jason
>>
>>
>> On Fri, Aug 12, 2016 at 9:20 AM, Alaa Zubaidi (PDF)
>>  wrote:
>> > Hi,
>> >
>> > I have a 16 Node cluster, Cassandra 2.2.1 on Windows, local installation
>> > (NOT on the cloud)
>> >
>> > and I am getting
>> > Error [CompactionExecutor:2] 2016-08-12 06:51:52, 983 Cassandra
>> > Daemon.java:183 - Execption in thread Thread[CompactionExecutor:2,1m
>> ain]
>> > org.apache.cassandra.io.FSReaderError:
>> > org.apache.cassandra.io.sstable.CorruptSSTableExecption:
>> > org.apache.cassandra.io.compress.CurrptBlockException:
>> > (E:\\la-4886-big-Data.db): corruption detected, chunk at
>> 4969092 of
>> > length 10208.
>> > at
>> > org.apache.cassandra.io.util.RandomAccessReader.readBytes(Ra
>> ndomAccessReader.java:357)
>> > ~[apache-cassandra-2.2.1.jar:2.2.1]
>> > 
>> > 
>> > ERROR [CompactionExecutor:2] ... FileUtils.java:463 - Existing
>> > forcefully due to file system exception on startup, disk failure policy
>> > "stop"
>> >
>> > I tried sstablescrub but it crashed with hs-err-pid-...
>> > I removed the corrupted file and started the Node again, after one day
>> the
>> > corruption came back again, I removed the files, and restarted
>> Cassandra, it
>> > worked for few days, then I ran "nodetool repair" after it finished,
>> > Cassandra failed again but with commitlog corruption, after removing the
>> > commitlog files, it failed again with another sstable corruption.
>> >
>> > I was also checking the HW, file system, and memory, the VMware logs
>> showed
>> > no HW error, also the HW management logs showed NO problems or issues.
>> > Also checked the Windows Logs (Application and System) the only thing I
>> > found is on the system logs "Cassandra Service terminated with
>> > service-specific error Cannot create another system semaphore.
>> >
>> > I could not find any thing regarding that error, all comments point to
>> > application log.
>> >
>> > Any help is appreciated..
>> >
>> > --
>> >
>> > Alaa Zubaidi
>> >
>> >
>> > This message may contain confidential and privileged information. If it
>> has
>> > been sent to you in error, please reply to advise the sender of the
>> error
>> > and then immediately permanently delete it and all attachments to it
>> from
>> > your systems. If you are not the intended recipient, do not read, copy,
>> > disclose or otherwise use this message or any attachments to it. The
>> sender
>> > disclaims any liability for such unauthorized use. PLEASE NOTE that all
>> > incoming e-mails sent to PDF e-mail accounts will be archived and may be
>> > scanned by us and/or by external service providers to detect and prevent
>> > threats to our systems, investigate illegal or inappropriate behavior,
>> > and/or eliminate unsolicited promotional e-mails (“spam”). If you have
>> any
>> > concerns about this process, please contact us at
>> legal.departm...@pdf.com.
>>
>
>
>
> --
>
> Alaa Zubaidi
> PDF Solutions, Inc.
> 333 West San Carlos Street, Suite 1000
> San Jose, CA 95110  USA
> Tel: 408-283-5639
> fax: 408-938-6479
> email: alaa.zuba...@pdf.com
>
>
>

Re: Debugging high tail read latencies (internal timeout)

2016-07-07 Thread Bryan Cheng

Hi Nimi,

My suspicions would probably lie somewhere between GC and large partitions.

The first tool would probably be a trace but if you experience full client
timeouts from dropped messages you may find it hard to find the issue. You
can try running the trace with cqlsh's timeouts cranked all the way against
the local node with CL=ONE to try to force the local machine to answer.

What does nodetool tpstats report for dropped message counts? Are they very
high? Primarily restricted to READ, or including MUTATION, etc. ?

Are there specific PK's that trigger this behavior, either all the time or
more consistently? That would finger either very large partition sizes or
potentially bad hardware on a node. cfhistograms will show you various
percentile partition sizes and your max as well.

GC should be accessible via JMX and also you should have GCInspector logs
in cassandra/system.log that should give you per-collection breakdowns.

--Bryan

On Wed, Jul 6, 2016 at 6:22 PM, Nimi Wariboko Jr 
wrote:

> Hi,
>
> I've begun experiencing very high tail latencies across my clusters. While
> Cassandra's internal metrics report <1ms read latencies, measuring
> responses from within the driver in my applications (roundtrips of
> query/execute frames), have 90% round trip times of up to a second for very
> basic queries (SELECT a,b FROM table WHERE pk=x).
>
> I've been studying the logs to try and get a handle on what could be going
> wrong. I don't think there are GC issues, but the logs mention dropped
> messages due to timeouts while the threadpools are nearly empty -
>
> https://gist.github.com/nemothekid/28b2a8e8353b3e60d7bbf390ed17987c
>
> Relevant line:
> REQUEST_RESPONSE messages were dropped in last 5000 ms: 1 for internal
> timeout and 0 for cross node timeout. Mean internal dropped latency: 54930
> ms and Mean cross-node dropped latency: 0 ms
>
> Are there any tools I can use to start to understand what is causing these
> issues?
>
> Nimi
>
>

Re: Cluster not working after upgrade from 2.1.12 to 3.5.0

2016-06-21 Thread Bryan Cheng

Hi Oskar,

I know this won't help you as quickly as you would like but please consider
updating the JIRA issue with details of your environment as it may help
move the investigation along.

Good luck!

On Tue, Jun 21, 2016 at 12:21 PM, Julien Anguenot 
wrote:

> You could try to sstabledump that one corrupted table, write some
> (Python) code to get rid of the duplicates processing that stabledump
> output (might not be bullet proof depending on data, I agree),
> truncate and re-insert them back in that table without duplicates.
>
> On Tue, Jun 21, 2016 at 11:52 AM, Oskar Kjellin 
> wrote:
> > Hmm, no way we can do that in prod :/
> >
> > Sent from my iPhone
> >
> >> On 21 juni 2016, at 18:50, Julien Anguenot  wrote:
> >>
> >> See my comments on the issue: I had to truncate and reinsert data in
> >> these corrupted tables.
> >>
> >> AFAIK, there is no evidence that UDTs are responsible of this bad
> behavior.
> >>
> >>> On Tue, Jun 21, 2016 at 11:45 AM, Oskar Kjellin <
> oskar.kjel...@gmail.com> wrote:
> >>> Yea I saw that one. We're not using UDT in the affected tables tho.
> >>>
> >>> Did you resolve it?
> >>>
> >>> Sent from my iPhone
> >>>
>  On 21 juni 2016, at 18:27, Julien Anguenot 
> wrote:
> 
>  I have experienced similar duplicate primary keys behavior with couple
>  of tables after upgrading from 2.2.x to 3.0.x.
> 
>  See comments on the Jira issue I opened at the time over there:
>  https://issues.apache.org/jira/browse/CASSANDRA-11887
> 
> 
> > On Tue, Jun 21, 2016 at 10:47 AM, Oskar Kjellin <
> oskar.kjel...@gmail.com> wrote:
> > Hi,
> >
> > We've done this upgrade in both dev and stage before and we did not
> see
> > similar issues.
> > After upgrading production today we have a lot issues tho.
> >
> > The main issue is that the Datastax client quite often does not get
> the data
> > (even though it's the same query). I see similar flakyness by simply
> running
> > cqlsh, although it does return it returns broken data.
> >
> > We are running a 3 node cluster with RF 3.
> >
> > I have this table
> >
> > CREATE TABLE keyspace.table (
> >
> > a text,
> >
> >   b text,
> >
> >   c text,
> >
> >   d list,
> >
> >   e text,
> >
> >   f timestamp,
> >
> >   g list,
> >
> >   h timestamp,
> >
> >   PRIMARY KEY (a, b, c)
> >
> > )
> >
> >
> > Every other time I query (not exactly every other time, but random)
> I get:
> >
> >
> > SELECT * from table where a = 'xxx' and b = 'xxx'
> >
> > a | b | c | d | e | f
> > | g| h
> >
> >
> -+--+---+--++-+---+-
> >
> > xxx |  xxx | ccc | null |   null | 2089-11-30
> > 23:00:00.00+ | ['fff'] | 2014-12-31 23:00:00.00+
> >
> > xxx |  xxx |   ddd |
>  null |
> > null | 2099-01-01 00:00:00.00+ | ['fff'] | 2016-06-17
> > 13:29:36.00+
> >
> >
> > Which is the expected output.
> >
> >
> > But I also get:
> >
> > a | b | c | d | e | f
> > | g| h
> >
> >
> -+--+---+--++-+---+-
> >
> > xxx |  xxx | ccc | null |   null |
> > null |  null |null
> >
> > xxx |  xxx | ccc | null |   null | 2089-11-30
> > 23:00:00.00+ | ['fff'] |null
> >
> > xxx |  xxx | ccc | null |   null |
> > null |  null | 2014-12-31 23:00:00.00+
> >
> > xxx |  xxx |   ddd |
>  null |
> > null |null |  null |
> > null
> >
> > xxx |  xxx |   ddd |
>  null |
> > null | 2099-01-01 00:00:00.00+ | ['fff'] |
> > null
> >
> > xxx |  xxx |   ddd |
>  null |
> > null |null |  null |
> 2016-06-17
> > 13:29:36.00+
> >
> >
> > Notice that the same PK is returned 3 times. With different parts of
> the
> > data. I believe this is what's currently killing our production
> environment.
> >
> >
> > I'm

Re: Incremental repairs in 3.0

2016-06-20 Thread Bryan Cheng

Sorry, meant to say "therefore manual migration procedure should be
UNnecessary"

On Mon, Jun 20, 2016 at 3:21 PM, Bryan Cheng <br...@blockcypher.com> wrote:

> I don't use 3.x so hopefully someone with operational experience can chime
> in, however my understanding is: 1) Incremental repairs should be the
> default in the 3.x release branch and 2) sstable repairedAt is now properly
> set in all sstables as of 2.2.x for standard repairs and therefore manual
> migration procedure should be necessary. It may still be a good idea to
> manually migrate if you have a sizable amount of data and are using LCS as
> anticompaction is rather painful.
>
> On Sun, Jun 19, 2016 at 6:37 AM, Vlad <qa23d-...@yahoo.com> wrote:
>
>> Hi,
>>
>> assuming I have new, empty Cassandra cluster, how should I start using
>> incremental repairs? Is incremental repair is default now (as I don't see
>> *-inc* option in nodetool) and nothing is needed to use it, or should we
>> perform migration procedure
>> <http://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesMigration.html>
>> anyway? And what happens to new column families?
>>
>> Regards.
>>
>
>

Re: Incremental repairs in 3.0

2016-06-20 Thread Bryan Cheng

I don't use 3.x so hopefully someone with operational experience can chime
in, however my understanding is: 1) Incremental repairs should be the
default in the 3.x release branch and 2) sstable repairedAt is now properly
set in all sstables as of 2.2.x for standard repairs and therefore manual
migration procedure should be necessary. It may still be a good idea to
manually migrate if you have a sizable amount of data and are using LCS as
anticompaction is rather painful.

On Sun, Jun 19, 2016 at 6:37 AM, Vlad  wrote:

> Hi,
>
> assuming I have new, empty Cassandra cluster, how should I start using
> incremental repairs? Is incremental repair is default now (as I don't see
> *-inc* option in nodetool) and nothing is needed to use it, or should we
> perform migration procedure
> 
> anyway? And what happens to new column families?
>
> Regards.
>

Re: OOM under high write throughputs on 2.2.5

2016-05-24 Thread Bryan Cheng

Hi Zhiyan,

Silly question but are you sure your heap settings are actually being
applied?  "697,236,904 (51.91%)" would represent a sub-2GB heap. What's the
real memory usage for Java when this crash happens?

Other thing to look into might be memtable_heap_space_in_mb, as it looks
like you're using onheap memtables. This will be 1/4 of your heap by
default. Assuming your heap settings are actually being applied, if you run
through this space you may not have enough flushing resources.
memtable_flush_Writers defaults to a somewhat low number which may not be
enough for this use case.

On Fri, May 20, 2016 at 10:02 PM, Zhiyan Shao  wrote:

> Hi, we see the following OOM crash while doing heavy write loading
> testing. Has anybody seen this kind of crash? We are using G1GC with 32GB
> heap size out of 128GB system memory. Eclipse Memory Analyzer shows the
> following:
>
> One instance of *"org.apache.cassandra.db.ColumnFamilyStore"* loaded by 
> *"sun.misc.Launcher$AppClassLoader
> @ 0x8d800898"* occupies *697,236,904 (51.91%)* bytes. The memory is
> accumulated in one instance of
> *"java.util.concurrent.ConcurrentSkipListMap$HeadIndex"* loaded by *" class loader>"*.
>
> *Keywords*
>
> java.util.concurrent.ConcurrentSkipListMap$HeadIndex
>
> sun.misc.Launcher$AppClassLoader @ 0x8d800898
>
> org.apache.cassandra.db.ColumnFamilyStore
>
> Cassandra log:
>
>
> ERROR 00:23:24 JVM state determined to be unstable.  Exiting forcefully
> due to:
> java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) ~[na:1.8.0_74]
> at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) ~[na:1.8.0_74]
> at
> org.apache.cassandra.utils.memory.SlabAllocator.getRegion(SlabAllocator.java:
> 137) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:
> 97) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:
> 57) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.utils.memory.ContextAllocator.clone
> (ContextAllocator.java:47) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.utils.memory.MemtableBufferAllocator.clone
> (MemtableBufferAllocator.java:61) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:212)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:
> 1249) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:406)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:366)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.db.Mutation.apply(Mutation.java:214)
> ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:
> 50) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:
> 67) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_74]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:
> 164) ~[apache-cassandra-2.2.5.jar:2.2.5]
> at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:
> 136) [apache-cassandra-2.2.5.jar:2.2.5]
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105)
> [apache-cassandra-2.2.5.jar:2.2.5]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
>
> Thanks,
> Zhiyan
>

Re: Increasing replication factor and repair doesn't seem to work

2016-05-24 Thread Bryan Cheng

Hi Luke,

I've never found nodetool status' load to be useful beyond a general
indicator.

You should expect some small skew, as this will depend on your current
compaction status, tombstones, etc. IIRC repair will not provide
consistency of intermediate states nor will it remove tombstones, it only
guarantees consistency in the final state. This means, in the case of
dropped hints or mutations, you will see differences in intermediate
states, and therefore storage footrpint, even in fully repaired nodes. This
includes intermediate UPDATE operations as well.

Your one node with sub 1GB sticks out like a sore thumb, though. Where did
you originate the nodetool repair from? Remember that repair will only
ensure consistency for ranges held by the node you're running it on. While
I am not sure if missing ranges are included in this, if you ran nodetool
repair only on a machine with partial ownership, you will need to complete
repairs across the ring before data will return to full consistency.

I would query some older data using consistency = ONE on the affected
machine to determine if you are actually missing data.  There are a few
outstanding bugs in the 2.1.x  and older release families that may result
in tombstone creation even without deletes, for example CASSANDRA-10547,
which impacts updates on collections in pre-2.1.13 Cassandra.

You can also try examining the output of nodetool ring, which will give you
a breakdown of tokens and their associations within your cluster.

--Bryan

On Tue, May 24, 2016 at 3:49 PM, kurt Greaves  wrote:

> Not necessarily considering RF is 2 so both nodes should have all
> partitions. Luke, are you sure the repair is succeeding? You don't have
> other keyspaces/duplicate data/extra data in your cassandra data directory?
> Also, you could try querying on the node with less data to confirm if it
> has the same dataset.
>
> On 24 May 2016 at 22:03, Bhuvan Rawal  wrote:
>
>> For the other DC, it can be acceptable because partition reside on one
>> node, so say  if you have a large partition, it may skew things a bit.
>> On May 25, 2016 2:41 AM, "Luke Jolly"  wrote:
>>
>>> So I guess the problem may have been with the initial addition of the
>>> 10.128.0.20 node because when I added it in it never synced data I
>>> guess?  It was at around 50 MB when it first came up and transitioned to
>>> "UN". After it was in I did the 1->2 replication change and tried repair
>>> but it didn't fix it.  From what I can tell all the data on it is stuff
>>> that has been written since it came up.  We never delete data ever so we
>>> should have zero tombstones.
>>>
>>> If I am not mistaken, only two of my nodes actually have all the data,
>>> 10.128.0.3 and 10.142.0.14 since they agree on the data amount. 10.142.0.13
>>> is almost a GB lower and then of course 10.128.0.20 which is missing
>>> over 5 GB of data.  I tried running nodetool -local on both DCs and it
>>> didn't fix either one.
>>>
>>> Am I running into a bug of some kind?
>>>
>>> On Tue, May 24, 2016 at 4:06 PM Bhuvan Rawal 
>>> wrote:
>>>
 Hi Luke,

 You mentioned that replication factor was increased from 1 to 2. In
 that case was the node bearing ip 10.128.0.20 carried around 3GB data
 earlier?

 You can run nodetool repair with option -local to initiate repair local
 datacenter for gce-us-central1.

 Also you may suspect that if a lot of data was deleted while the node
 was down it may be having a lot of tombstones which is not needed to be
 replicated to the other node. In order to verify the same, you can issue a
 select count(*) query on column families (With the amount of data you have
 it should not be an issue) with tracing on and with consistency local_all
 by connecting to either 10.128.0.3  or 10.128.0.20 and store it in a
 file. It will give you a fair amount of idea about how many deleted cells
 the nodes have. I tried searching for reference if tombstones are moved
 around during repair, but I didnt find evidence of it. However I see no
 reason to because if the node didnt have data then streaming tombstones
 does not make a lot of sense.

 Regards,
 Bhuvan

 On Tue, May 24, 2016 at 11:06 PM, Luke Jolly 
 wrote:

> Here's my setup:
>
> Datacenter: gce-us-central1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens   Owns (effective)  Host ID
>   Rack
> UN  10.128.0.3   6.4 GB 256  100.0%
>  3317a3de-9113-48e2-9a85-bbf756d7a4a6  default
> UN  10.128.0.20  943.08 MB  256  100.0%
>  958348cb-8205-4630-8b96-0951bf33f3d3  default
> Datacenter: gce-us-east1
> 
> Status=Up/Down
> |/

Re: Limit 1

2016-04-21 Thread Bryan Cheng

As far as I know, the answer is yes, however it is unlikely that the cursor
will have to probe very far to find a valid row unless your data is highly
bursty. The key cache (assuming you have it enabled) will allow the query
to skip unrelated rows in its search.

However I would caution against TTL'ing the world and generating a 1-to-1
ratio of writes to deletes.

One approach you can try is to compound your primary key with the hour.
Then the latest hour of events can be retrieved by PK lookup. If you delete
older rows that are outside of the partition you're operating on, your
cursor will not have to skip tombstones to find valid results.

On Wed, Apr 20, 2016 at 11:37 AM, Jimmy Lin  wrote:

> I have a following table(using default sized tier compaction) that its
> column get TTLed every hour(as we want to keep only the last 1 hour events)
>
> And I do
> Select * from mytable where object_id = ‘’ LIMIT 1;
>
> And since query only interested in last/latest value, will cassandra need
> to scan multiple sstables or potentially skipping tombstones data just to
> get the top of the latest data?
>
> Or is it smart enough to know the beginning of the sstables and get the
> result very efficiently?
>
>
> CREATE TABLE mytable (
> object_id text,
> created timeuuid,
> my_data text
> PRIMARY KEY (object_id, created)
> ) WITH CLUSTERING ORDER BY (created DESC)
>

Re: Cassandra Golang Driver and Support

2016-04-13 Thread Bryan Cheng

Hi Yawei,

While you're right that there's no first-party driver, we've had good luck
using gocql (https://github.com/gocql/gocql) in production at moderate
scale. What features in particular are you looking for that are missing?

--Bryan

On Tue, Apr 12, 2016 at 10:06 PM, Yawei Li  wrote:

> Hi,
>
> It looks like to me that DataStax doesn't provide official golang driver
> yet and the goland client libs are overall lagging behind the Java driver
> in terms of feature set, supported version and possibly production
> stability?
>
> We are going to support a large number of services  in both Java and Go.
> if the above impression is largely true, we are considering the option of
> focusing on Java client and having GoLang program talk to the Java service
> via RPC for data access. Anyone has tried similar approach?
>
> Thanks
>

Re: Unable to connect to CQLSH or Launch SparkContext

2016-04-11 Thread Bryan Cheng

Check your environment variables, looks like JAVA_HOME is not properly set

On Mon, Apr 11, 2016 at 9:07 AM, Lokesh Ceeba - Vendor <
lokesh.ce...@walmart.com> wrote:

> Hi Team,
>
>   Help required
>
>
>
> cassandra:/app/cassandra $ nodetool status
>
>
>
> Cassandra 2.0 and later require Java 7u25 or later.
>
> cassandra:/app/cassandra $ nodetool status
>
>
>
> Cassandra 2.0 and later require Java 7u25 or later.
>
> cassandra:/app/cassandra $ java -version
>
> Error occurred during initialization of VM
>
> java.lang.OutOfMemoryError: unable to create new native thread
>
>
>
>
>
>
>
> --
>
> Lokesh
> This email and any files transmitted with it are confidential and intended
> solely for the individual or entity to whom they are addressed. If you have
> received this email in error destroy it immediately. *** Walmart
> Confidential ***
>

Re: Large primary keys

2016-04-11 Thread Bryan Cheng

While large primary keys (within reason) should work, IMO anytime you're
doing equality testing you are really better off minimizing the size of the
key. Huge primary keys will also have very negative impacts on your key
cache. I would err on the side of the digest, but I've never had a need for
large keys so perhaps someone who has used them before would have a
different perspective.

On Mon, Apr 11, 2016 at 2:43 PM, Robert Wille  wrote:

> I have a need to be able to use the text of a document as the primary key
> in a table. These texts are usually less than 1K, but can sometimes be 10’s
> of K’s in size. Would it be better to use a digest of the text as the key?
> I have a background process that will occasionally need to do a full table
> scan and retrieve all of the texts, so using the digest doesn’t eliminate
> the need to store the text. Anyway, is it better to keep primary keys
> small, or is C* okay with large primary keys?
>
> Robert
>
>

Re: Cassandra sstable to Mysql

2016-04-02 Thread Bryan Cheng

You have SSTables and you want to get importable data?

You could use a tool like sstabletojson to get json formatted data directly
from the sstables; however, unless they've been perfectly compacted, there
will be duplicates and updates interleaved that will be properly ordered.

If this is a full dump from a single machine that has a complete dataset
(eg. with RF=n) you could spin up a new machine with just itself as a seed
but all other configuration intact. If this new machine gets an identical
copy of the cassandra data directory, it will start itself as a clone of
the machine the dump came off of, but walled off from the previous cluster.
(I have tested this with vnodes, but I believe it is more involved without
vnodes). Then you can use CQL COPY or an application to bulk load into
MySQL.

On Fri, Apr 1, 2016 at 2:55 AM, Abhishek Aggarwal <
abhishek.aggarwa...@snapdeal.com> wrote:

> Hi ,
>
> We have the data dump into directory  taken from Mysql using the
> CQLSSTableWriter.
>
> Our requirement is to read this data and load it into MySql. We don't want
> to use Cassandra as it will lead to read traffic and this operation is just
> for some validation .
>
> Can anyone help us with the solution.
>
> Abhishek Aggarwal
>
> *Senior Software Engineer*
> *M*: +91 8861212073 , 8588840304
> *T*: 0124 6600600 *EXT*: 12128
> ASF Center -A, ASF Center Udyog Vihar Phase IV,
> Download Our App
> [image: A]
> 
>  [image:
> A]
> 
>  [image:
> W]
> 
>

Re: cassandra disks cache on SSD

2016-04-02 Thread Bryan Cheng

Hi Vincent, have you already tried the more common tuning operations like
row cache?

I haven't done any disk level caching like this (we use SSD's exclusively),
but you may see some benefit from putting your commitlog on a separate
conventional HDD if you haven't tried this already. This may push your
read/write pattern to do more sequential access.

On Fri, Apr 1, 2016 at 12:48 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Can you provide me a approximate estimation of performance gain ?
>
> 2016-04-01 19:27 GMT+02:00 Mateusz Korniak :
>
>> On Friday 01 April 2016 13:16:53 vincent gromakowski wrote:
>> > (...)  looking
>> > for a way to use some kind of tiering with few SSD caching hot data from
>> > HDD.
>> > I have identified two solutions (...)
>>
>> We are using lvmcache for that.
>> Regards,
>> --
>> Mateusz Korniak
>> "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
>> krótko mówiąc - podpora społeczeństwa."
>> Nikos Kazantzakis - "Grek Zorba"
>>
>>
>

Re: Multi DC setup for analytics

2016-03-31 Thread Bryan Cheng

I'm jumping into this thread late, so sorry if this has been covered
before. But am I correct in reading that you have two different Cassandra
rings, not talking to each other at all, and you want to have a shared DC
with a third Cassandra ring?

I'm not sure what you want to do is possible.

If I had the luxury of starting from scratch, the design I would do is:
All three DC's in one cluster, with 3 datacenters. DC3 is the analytics DC.
DC1's keyspaces are replicated to DC1 and DC3 only.
DC2's keyspaces are replicated to DC2 and DC3 only.

Then you have DC3 with all data from both DC1 and DC2 to run analytics on,
and no cross-talk between DC1 and DC2.

If you cannot rebuild your existing clusters, you may want to consider
using something like Spark to ETL your data out of DC1 and DC2 into a new
cluster at DC3. At that point you're running a data warehouse and lose some
of the advantages of seemless cluster membership.

On Wed, Mar 30, 2016 at 5:43 AM, Anishek Agarwal  wrote:

> Hey Guys,
>
> We did the necessary changes and were trying to get this back on track,
> but hit another wall,
>
> we have two Clusters in Different DC ( DC1 and DC2) with cluster names (
> CLUSTER_1, CLUSTER_2)
>
> we want to have a common analytics cluster in DC3 with cluster name
> (CLUSTER_3). -- looks like this can't be done, so we have to setup two
> different analytics cluster ? can't we just get data from CLUSTER_1/2 to
> same cluster CLUSTER_3 ?
>
> thanks
> anishek
>
> On Mon, Mar 21, 2016 at 3:31 PM, Anishek Agarwal 
> wrote:
>
>> Hey Clint,
>>
>> we have two separate rings which don't talk to each other but both having
>> the same DC name "DCX".
>>
>> @Raja,
>>
>> We had already gone towards the path you suggested.
>>
>> thanks all
>> anishek
>>
>> On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja  wrote:
>>
>>> Yes. Here are the steps.
>>> You will have to change the DC Names first.
>>> DC1 and DC2 would be independent clusters.
>>>
>>> Create a new DC, DC3 and include these two DC's on DC3.
>>>
>>> This should work well.
>>>
>>>
>>> On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin <
>>> clintlmar...@coolfiretechnologies.com> wrote:
>>>
 When you say you have two logical DC both with the same name are you
 saying that you have two clusters of servers both with the same DC name,
 nether of which currently talk to each other? IE they are two separate
 rings?

 Or do you mean that you have two keyspaces in one cluster?

 Or?

 Clint
 On Mar 14, 2016 2:11 AM, "Anishek Agarwal"  wrote:

> Hello,
>
> We are using cassandra 2.0.17 and have two logical DC having different
> Keyspaces but both having same logical name DC1.
>
> we want to setup another cassandra cluster for analytics which should
> get data from both the above DC.
>
> if we setup the new DC with name DC2 and follow the steps
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html
> will it work ?
>
> I would think we would have to first change the names of existing
> clusters to have to different names and then go with adding another dc
> getting data from these?
>
> Also as soon as we add the node the data starts moving... this will
> all be only real time changes done to the cluster right ? we still have to
> do the rebuild to get the data for tokens for node in new cluster ?
>
> Thanks
> Anishek
>

>>>
>>>
>>> --
>>> "In this world, you either have an excuse or a story. I preferred to
>>> have a story"
>>>
>>
>>
>

Re: Cassandra Upgrade 3.0.x vs 3.x (Tick-Tock Release)

2016-03-14 Thread Bryan Cheng

Hi Kathir,

The specific version will depend on your needs (eg. libraries) and
risk/stability profile. Personally, I generally go with the oldest branch
with still active maintenance (which would be 2.2.x or 2.1.x if you only
need critical fixes), but there's lots of good stuff in 3.x if you're happy
being a little closer to the bleeding edge.

There was a bit of discussion elsewhere on this list, eg here:
https://www.mail-archive.com/user@cassandra.apache.org/msg45990.html,
searching may turn up some more recommendations.

--Bryan

On Mon, Mar 14, 2016 at 12:40 PM, Kathiresan S  wrote:

> Hi,
>
> We are planning for Cassandra upgrade in our production environment.
> Which version of Cassandra is stable and is advised to upgrade to, at the
> moment?
>
> Looking at this JIRA (CASSANDRA-10822
> ), it looks like,
> if at all we plan to upgrade any recent version, it should be >= 3.0.2/3.2
>
> Should it be 3.0.4 / 3.0.3 / 3.3 or 3.4 ? In general, is it a good
> practice to upgrade to a Tick-Tock release instead of 3.0.X version. Please
> advice.
>
> Thanks,
> Kathir
>

Re: Unexplainably large reported partition sizes

2016-03-07 Thread Bryan Cheng

Hi Tom,

Do you use any collections on this column family? We've had issues in the
past with unexpectedly large partitions reported on data models with
collections, which can also generate tons of tombstones on UPDATE (
https://issues.apache.org/jira/browse/CASSANDRA-10547)

--Bryan

On Mon, Mar 7, 2016 at 11:23 AM, Robert Coli  wrote:

> On Sat, Mar 5, 2016 at 9:16 AM, Tom van den Berge 
> wrote:
>
>> I don't think compression can be the cause of the difference, because of
>> two reasons:
>>
>
> Your two reasons seem legitimate.
>
> Though you say you do not frequently do DELETE and so it shouldn't be due
> to tombstones, there are semi-recent versions of Cassandra which create a
> runaway avalanche of tombstones that double every time they are compacted.
> What version are you running?
>
> Also, is there some reason you are not just dumping the table with
> sstable2json and inspecting the contents of the row in question?
>
> =Rob
>
>
>
>

Re: Modeling transactional messages

2016-03-04 Thread Bryan Cheng

I think most people will tell you what Sean did- queues are considered an
anti-pattern for many reasons in Cassandra, and while it's possible, you
may want to consider something more suited for the job (RabbitMQ, redis
queues are just a few ideas that come to mind).

If you're sold on the idea of using Cassandra for this, you will likely
need a few tables, as Sean points out. Also, you should try to avoid the
situation where you have a 1:1 ratio of writes to deletes (everything
written will be deleted, and quickly)- this will exercise a great many
limitations in Cassandra's design. For queues, especially if you 1) want to
act quickly on the contents of the queue, 2) cannot miss a message, and 3)
do not want duplicate actions, you're going to have trouble with a
distributed system like Cassandra.

One common approach is to persist the actual data (message contents) in
Cassandra keyed by eg. msgid timeuuid, or (userid, msgid) , and to use a
durable queue like RabbitMQ to contain the uuids for the actual queue
behavior, circumventing the need for deletes and simplifying the
failure/retry logic. Then you get historical lookups (give me all emails
I've sent to userid) as well.

On Fri, Mar 4, 2016 at 1:36 PM, I PVP  wrote:

> Thanks for  answering.
>
> Yes, It is mainly a queue, but also has some functionality to allow resend
> the messages.
>
> Does anyone have experience handling this kind of scenario, within (or
> without) Cassandra?
>
> Thanks
>
> --
> IPVP
>
>
> From: sean_r_dur...@homedepot.com 
> 
> Reply: user@cassandra.apache.org >
> 
> Date: March 4, 2016 at 11:48:56 AM
> To: user@cassandra.apache.org >
> 
> Subject:  RE: Modeling transactional messages
>
> As you have it, this is not a good model for Cassandra. Your partition key
> has only 2 specific values. You would end up with only 2 partitions
> (perhaps owned by just 2 nodes) that would quickly get huge (and slow).
> Also, secondary indexes are generally a bad idea. You would either want to
> create new table to support additional queries or look at the materialized
> views in the 3.x versions.
>
>
>
> You are setting up something like a queue, which is typically an
> anti-pattern for Cassandra.
>
>
>
> However, I will at least toss out an idea for the rest of the community to
> improve (or utterly reject):
>
>
>
> You could have an unsent mail table and a sent mail table.
>
> For unsent mail, just use the objectID as the partition key. The drivers
> can page through results, though if it gets very large, you might see
> problems. Delete the row from unsent mail once it is sent. Try leveled
> compaction with a short gc_grace. There would be a lot of churn on this
> table, so it may still be less than ideal.
>
>
>
> Then you could do the sent email table with objectID and all the email
> details. Add separate lookup tables for:
>
> - (emailaddr), object ID (if this is going to be large/wide, perhaps add a
> time bucket to the partition key, like mm)
>
> - (domain, time bucket), objectID
>
>
>
> Set TTL on these rows (either default or with the insert) to get the purge
> to be automatic.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* I PVP [mailto:i...@hotmail.com]
> *Sent:* Thursday, March 03, 2016 7:51 PM
> *To:* user@cassandra.apache.org
> *Subject:* Modeling transactional messages
>
>
>
> Hi everyone,
>
>
>
> Can anyone please let me know if I am heading to an antiparttern or
> somethingelse bad?
>
>
>
> How would you model the following ... ?
>
>
>
> I am migrating from MYSQL to Cassandra, I have a scenario in which need to
> store the content of "to be sent" transactional email messages that the
> customer will receive on events like : an order was created, an order was
> updated, an order was canceled,an order was  shipped,an account was
> created, an account was confirmed, an account was locked and so on.
>
>
>
> On MYSQL there is table for email message "type", like: a table to store
> messages of "order-created”, a table to store messages of "order-updated"
> and so on.
>
>
>
> The messages are sent by a non-parallelized java worker, scheduled to run
> every X seconds, that push the messages to a service like
> Sendgrid/Mandrill/Mailjet.
>
>
>
> For better performance, easy to purge and overall code maintenance I am
> looking to have all message "types" on a single table/column family as
> following:
>
>
>
> CREATE TABLE communication.transactional_email (
>
> objectid timeuuid,
>
> subject text,
>
> content text,
>
> fromname text,
>
> fromaddr text,
>
> toname text,
>
> toaddr text,
>
> wassent boolean,
>
> createdate timestamp,
>
> sentdate timestamp,
>
> type text,// example: order_created, order_canceled
>
> domain text, // exaple: hotmail.com. in case need to stop sending to a
> specific domain
>
> PRIMARY KEY (wassent, objectid)
>
> );
>
>
>
>

Re: Lot of GC on two nodes out of 7

2016-03-03 Thread Bryan Cheng

Hi Anishek,

In addition to the good advice others have given, do you notice any
abnormally large partitions? What does cfhistograms report for 99%
partition size? A few huge partitions will cause very disproportionate load
on your cluster, including high GC.

--Bryan

On Wed, Mar 2, 2016 at 9:28 AM, Amit Singh F 
wrote:

> Hi Anishek,
>
>
>
> We too faced similar problem in 2.0.14 and after doing some research we
> config few parameters in Cassandra.yaml and was able to overcome GC pauses
> . Those are :
>
>
>
> · memtable_flush_writers : increased from 1 to 3 as from tpstats
> output  we can see mutations dropped so it means writes are getting
> blocked, so increasing number will have those catered.
>
> · memtable_total_space_in_mb : Default (1/4 of heap size), can
> lowered because larger long lived objects will create pressure on HEAP, so
> its better to reduce some amount of size.
>
> · Concurrent_compactors : Alain righlty pointed out this i.e
> reduce it to 8. You need to try this.
>
>
>
> Also please check whether you have mutations drop in other nodes or not.
>
>
>
> Hope this helps in your cluster too.
>
>
>
> Regards
>
> Amit Singh
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* Wednesday, March 02, 2016 9:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Lot of GC on two nodes out of 7
>
>
>
> Can you post a gist of the output of jstat -gccause (60 seconds worth)?  I
> think it's cool you're willing to experiment with alternative JVM settings
> but I've never seen anyone use max tenuring threshold of 50 either and I
> can't imagine it's helpful.  Keep in mind if your objects are actually
> reaching that threshold it means they've been copied 50x (really really
> slow) and also you're going to end up spilling your eden objects directly
> into your old gen if your survivor is full.  Considering the small amount
> of memory you're using for heap I'm really not surprised you're running
> into problems.
>
>
>
> I recommend G1GC + 12GB heap and just let it optimize itself for almost
> all cases with the latest JVM versions.
>
>
>
> On Wed, Mar 2, 2016 at 6:08 AM Alain RODRIGUEZ  wrote:
>
> It looks like you are doing a good work with this cluster and know a lot
> about JVM, that's good :-).
>
>
>
> our machine configurations are : 2 X 800 GB SSD , 48 cores, 64 GB RAM
>
>
>
> That's good hardware too.
>
>
>
> With 64 GB of ram I would probably directly give a try to
> `MAX_HEAP_SIZE=8G` on one of the 2 bad nodes probably.
>
>
>
> Also I would also probably try lowering `HEAP_NEWSIZE=2G.` and using
> `-XX:MaxTenuringThreshold=15`, still on the canary node to observe the
> effects. But that's just an idea of something I would try to see the
> impacts, I don't think it will solve your current issues or even make it
> worse for this node.
>
>
>
> Using G1GC would allow you to use a bigger Heap size. Using C*2.1 would
> allow you to store the memtables off-heap. Those are 2 improvements
> reducing the heap pressure that you might be interested in.
>
>
>
> I have spent time reading about all other options before including them
> and a similar configuration on our other prod cluster is showing good GC
> graphs via gcviewer.
>
>
>
> So, let's look for an other reason.
>
>
>
> there are MUTATION and READ messages dropped in high number on nodes in
> question and on other 5 nodes it varies between 1-3.
>
>
>
> - Is Memory, CPU or disk a bottleneck? Is one of those running at the
> limits?
>
>
>
> concurrent_compactors: 48
>
>
>
> Reducing this to 8 would free some space for transactions (R requests).
> It is probably worth a try, even more when compaction is not keeping up and
> compaction throughput is not throttled.
>
>
>
> Just found an issue about that:
> https://issues.apache.org/jira/browse/CASSANDRA-7139
>
>
>
> Looks like `concurrent_compactors: 8` is the new default.
>
>
>
> C*heers,
>
> ---
>
> Alain Rodriguez - al...@thelastpickle.com
>
> France
>
>
>
> The Last Pickle - Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> 2016-03-02 12:27 GMT+01:00 Anishek Agarwal :
>
> Thanks a lot Alian for the details.
>
> `HEAP_NEWSIZE=4G.` is probably far too high (try 1200M <-> 2G)
> `MAX_HEAP_SIZE=6G` might be too low, how much memory is available (You
> might want to keep this as it or even reduce it if you have less than 16 GB
> of native memory. Go with 8 GB if you have a lot of memory.
> `-XX:MaxTenuringThreshold=50` is the highest value I have seen in use so
> far. I had luck with values between 4 <--> 16 in the past. I would give  a
> try with 15.
> `-XX:CMSInitiatingOccupancyFraction=70`--> Why not using default - 75 ?
> Using default and then tune from there to improve things is generally a
> good idea.
>
>
>
>
>
> we have a lot of reads and writes onto the system so keeping the high new
> size to make sure enough is held in memory

Re: Checking replication status

2016-03-01 Thread Bryan Cheng

HI Jeremy,

For more insight into the hint system, these two blog posts are great
resources: http://www.datastax.com/dev/blog/modern-hinted-handoff, and
http://www.datastax.com/dev/blog/whats-coming-to-cassandra-in-3-0-improved-hint-storage-and-delivery
.

For timeframes, that's going to differ based on your read/write patterns
and load. Although I haven't tried this before, I believe you can
query the system.hints
table to see the status of hints queued by the local machine.

--local and --dc are similar in the sense that they are always repairs
against the local datacenter, they just differ in syntax. If you sustain
loss of inter-dc connectivity for longer than max_hint_window_in_ms, you'll
want to run a cross-dc repair, which is just the standard full repair
(without specifying either).

On Mon, Feb 29, 2016 at 7:38 PM, Jimmy Lin <y2klyf+w...@gmail.com> wrote:

> hi Bryan,
> I guess I want to find out if there is any way to tell when data will
> become consistent again in both cases.
>
> if the node being down shorter than the max_hint_window(say 2 hours out of
> 3 hrs max), is there anyway to check the log or JMX etc to see if the hint
> queue size back to zero or lower range?
>
>
> if node goes down longer than max_hint_window time (say 4 hrs hours > our
> max 3 hrs), we run repair job. What is the correct nodetool repair job
> syntax to use?
> in particular what is the difference between -local vs -dc? they both
> seems to indicate repairing nodes within a datacenter, but for across DC
> network outage, we want to repair nodes across DCs right?
>
> thanks
>
>
>
> On Fri, Feb 26, 2016 at 3:38 PM, Bryan Cheng <br...@blockcypher.com>
> wrote:
>
>> Hi Jimmy,
>>
>> If you sustain a long downtime, repair is almost always the way to go.
>>
>> It seems like you're asking to what extent a cluster is able to
>> recover/resync a downed peer.
>>
>> A peer will not attempt to reacquire all the data it has missed while
>> being down. Recovery happens in a few ways:
>>
>> 1) Hints: Assuming that there are enough peers to satisfy your quorum
>> requirements on write, the live peers will queue up these operations for up
>> to max_hint_window_in_ms (from cassandra.yaml). These hints will be
>> delivered once the peer recovers.
>> 2) Read repair: There is a probability that read repair will happen,
>> meaning that a query will trigger data consistency checks and updates _on
>> the query being performed_.
>> 3) Repair.
>>
>> If a machine goes down for longer than max_hint_window_in_ms, AFAIK you
>> _will_ have missing data. If you cannot tolerate this situation, you need
>> to take a look at your tunable consistency and/or trigger a repair.
>>
>> On Thu, Feb 25, 2016 at 7:26 PM, Jimmy Lin <y2klyf+w...@gmail.com> wrote:
>>
>>> so far they are not long, just some config change and restart.
>>> if it is a 2 hrs downtime due to whatever reason, a repair is better
>>> option than trying to figure out if replication syn finish or not?
>>>
>>> On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle <daeme...@gmail.com>
>>> wrote:
>>>
>>>> Hmm. What are your processes when a node comes back after "a long
>>>> offline"? Long enough to take the node offline and do a repair? Run the
>>>> risk of serving stale data? Parallel repairs? ???
>>>>
>>>> So, what sort of time frames are "a long time"?
>>>>
>>>>
>>>> *...*
>>>>
>>>>
>>>>
>>>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>>>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>>>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>>>
>>>> On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin <y2k...@gmail.com> wrote:
>>>>
>>>>> hi all,
>>>>>
>>>>> what are the better ways to check replication overall status of cassandra 
>>>>> cluster?
>>>>>
>>>>>  within a single DC, unless a node is down for long time, most of the 
>>>>> time i feel it is pretty much non-issue and things are replicated pretty 
>>>>> fast. But when a node come back from a long offline, is there a way to 
>>>>> check that the node has finished its data sync with other nodes  ?
>>>>>
>>>>>  Now across DC, we have frequent VPN outage (sometime short sometims 
>>>>> long) between DCs, i also like to know if there is a way to find how the 
>>>>> replication progress between DC catching up under this condtion?
>>>>>
>>>>>  Also, if i understand correctly, the only gaurantee way to make sure 
>>>>> data are synced is to run a complete repair job,
>>>>> is that correct? I am trying to see if there is a way to "force a quick 
>>>>> replication sync" between DCs after vpn outage.
>>>>> Or maybe this is unnecessary, as Cassandra will catch up as fast as it 
>>>>> can, there is nothing else we/(system admin) can do to make it faster or 
>>>>> better?
>>>>>
>>>>>
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Checking replication status

2016-02-26 Thread Bryan Cheng

Hi Jimmy,

If you sustain a long downtime, repair is almost always the way to go.

It seems like you're asking to what extent a cluster is able to
recover/resync a downed peer.

A peer will not attempt to reacquire all the data it has missed while being
down. Recovery happens in a few ways:

1) Hints: Assuming that there are enough peers to satisfy your quorum
requirements on write, the live peers will queue up these operations for up
to max_hint_window_in_ms (from cassandra.yaml). These hints will be
delivered once the peer recovers.
2) Read repair: There is a probability that read repair will happen,
meaning that a query will trigger data consistency checks and updates _on
the query being performed_.
3) Repair.

If a machine goes down for longer than max_hint_window_in_ms, AFAIK you
_will_ have missing data. If you cannot tolerate this situation, you need
to take a look at your tunable consistency and/or trigger a repair.

On Thu, Feb 25, 2016 at 7:26 PM, Jimmy Lin  wrote:

> so far they are not long, just some config change and restart.
> if it is a 2 hrs downtime due to whatever reason, a repair is better
> option than trying to figure out if replication syn finish or not?
>
> On Thu, Feb 25, 2016 at 1:09 PM, daemeon reiydelle 
> wrote:
>
>> Hmm. What are your processes when a node comes back after "a long
>> offline"? Long enough to take the node offline and do a repair? Run the
>> risk of serving stale data? Parallel repairs? ???
>>
>> So, what sort of time frames are "a long time"?
>>
>>
>> *...*
>>
>>
>>
>> *Daemeon C.M. ReiydelleUSA (+1) 415.501.0198
>> <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
>> <%28%2B44%29%20%280%29%2020%208144%209872>*
>>
>> On Thu, Feb 25, 2016 at 11:36 AM, Jimmy Lin  wrote:
>>
>>> hi all,
>>>
>>> what are the better ways to check replication overall status of cassandra 
>>> cluster?
>>>
>>>  within a single DC, unless a node is down for long time, most of the time 
>>> i feel it is pretty much non-issue and things are replicated pretty fast. 
>>> But when a node come back from a long offline, is there a way to check that 
>>> the node has finished its data sync with other nodes  ?
>>>
>>>  Now across DC, we have frequent VPN outage (sometime short sometims long) 
>>> between DCs, i also like to know if there is a way to find how the 
>>> replication progress between DC catching up under this condtion?
>>>
>>>  Also, if i understand correctly, the only gaurantee way to make sure data 
>>> are synced is to run a complete repair job,
>>> is that correct? I am trying to see if there is a way to "force a quick 
>>> replication sync" between DCs after vpn outage.
>>> Or maybe this is unnecessary, as Cassandra will catch up as fast as it can, 
>>> there is nothing else we/(system admin) can do to make it faster or better?
>>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>
>>
>

Re: Cassandra Multi DC (Active-Active) Setup - Measuring latency & throughput performance

2016-02-26 Thread Bryan Cheng

Hi Chandra,

For write latency, etc. the tools are still largely the same set of tools
you'd use for single-DC- stuff like tracing, cfhistograms, cassandra-stress
come to mind. The exact results are going to differ based on your
consistency tuning (can you get away with LOCAL_QUORUM vs QUORUM?) and
read/write patterns.

What other data are you looking to gather?

On Fri, Feb 26, 2016 at 5:53 AM,  wrote:

> Hi,
>
>
> Are there any links/resources which describe performance measurement
> (latency & throughput) for a Cassandra Multi DC Active-Active setup across
> a WAN network (20Gbps bandwidth) with 5 nodes in each DC.
>
>
> Basically, I would like to know how to measure latency of writes when data
> is replicated across DC (local/remote) in active-active cluster setup
>
>
> Regards, Chandra KR
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>

Re: "Not enough replicas available for query" after reboot

2016-02-04 Thread Bryan Cheng

Hey Flavien!

Did your reboot come with any other changes (schema, configuration,
topology, version)?

On Thu, Feb 4, 2016 at 2:06 PM, Flavien Charlon 
wrote:

> I'm using the C# driver 2.5.2. I did try to restart the client
> application, but that didn't make any difference, I still get the same
> error after restart.
>
> On 4 February 2016 at 21:54,  wrote:
>
>> What client are you using?
>>
>>
>>
>> It is possible that the client saw nodes down and has kept them marked
>> that way (without retrying). Depending on the client, you may have options
>> to set in RetryPolicy, FailoverPolicy, etc. A bounce of the client will
>> probably fix the problem for now.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Flavien Charlon [mailto:flavien.char...@gmail.com]
>> *Sent:* Thursday, February 04, 2016 4:06 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: "Not enough replicas available for query" after reboot
>>
>>
>>
>> Yes, all three nodes see all three nodes as UN.
>>
>>
>>
>> Also, connecting from a local Cassandra machine using cqlsh, I can run
>> the same query just fine (with QUORUM consistency level).
>>
>>
>>
>> On 4 February 2016 at 21:02, Robert Coli  wrote:
>>
>> On Thu, Feb 4, 2016 at 12:53 PM, Flavien Charlon <
>> flavien.char...@gmail.com> wrote:
>>
>> My cluster was running fine. I rebooted all three nodes (one by one), and
>> now all nodes are back up and running. "nodetool status" shows UP for all
>> three nodes on all three nodes:
>>
>>
>>
>> --  AddressLoad   Tokens  OwnsHost ID
>>   Rack
>>
>> UN  xx.xx.xx.xx331.84 GB  1   ?
>> d3d3a79b-9ca5-43f9-88c4-c3c7f08ca538  RAC1
>>
>> UN  xx.xx.xx.xx317.2 GB   1   ?
>> de7917ed-0de9-434d-be88-bc91eb4f8713  RAC1
>>
>> UN  xx.xx.xx.xx  291.61 GB  1   ?
>> b489c970-68db-44a7-90c6-be734b41475f  RAC1
>>
>>
>>
>> However, now the client application fails to run queries on the cluster
>> with:
>>
>>
>>
>> Cassandra.UnavailableException: Not enough replicas available for query
>> at consistency Quorum (2 required but only 1 alive)
>>
>>
>>
>> Do *all* nodes see each other as UP/UN?
>>
>>
>>
>> =Rob
>>
>>
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>
>

Re: EC2 storage options for C*

2016-02-03 Thread Bryan Cheng

ke any difference?
>>>>>>>
>>>>>>> What info is available on EBS performance at peak times, when
>>>>>>> multiple AWS customers have spikes of demand?
>>>>>>>
>>>>>>> Is RAID much of a factor or help at all using EBS?
>>>>>>>
>>>>>>> How exactly is EBS provisioned in terms of its own HA - I mean, with
>>>>>>> a properly configured Cassandra cluster RF provides HA, so what is the
>>>>>>> equivalent for EBS? If I have RF=3, what assurance is there that those
>>>>>>> three EBS volumes aren't all in the same physical rack?
>>>>>>>
>>>>>>> For multi-data center operation, what configuration options assure
>>>>>>> that the EBS volumes for each DC are truly physically separated?
>>>>>>>
>>>>>>> In terms of syncing data for the commit log, if the OS call to sync
>>>>>>> an EBS volume returns, is the commit log data absolutely 100% synced at 
>>>>>>> the
>>>>>>> hardware level on the EBS end, such that a power failure of the systems 
>>>>>>> on
>>>>>>> which the EBS volumes reside will still guarantee availability of the
>>>>>>> fsynced data. As well, is return from fsync an absolute guarantee of
>>>>>>> sstable durability when Cassandra is about to delete the commit log,
>>>>>>> including when the two are on different volumes? In practice, we would 
>>>>>>> like
>>>>>>> some significant degree of pipelining of data, such as during the full
>>>>>>> processing of flushing memtables, but for the fsync at the end a solid
>>>>>>> guarantee is needed.
>>>>>>>
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Mon, Feb 1, 2016 at 12:56 AM, Eric Plowe <eric.pl...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Jeff,
>>>>>>>>
>>>>>>>> If EBS goes down, then EBS Gp2 will go down as well, no? I'm not
>>>>>>>> discounting EBS, but prior outages are worrisome.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sunday, January 31, 2016, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Free to choose what you'd like, but EBS outages were also
>>>>>>>>> addressed in that video (second half, discussion by Dennis Opacki). 
>>>>>>>>> 2016
>>>>>>>>> EBS isn't the same as 2011 EBS.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jeff Jirsa
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Jan 31, 2016, at 8:27 PM, Eric Plowe <eric.pl...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you all for the suggestions. I'm torn between GP2 vs
>>>>>>>>> Ephemeral. GP2 after testing is a viable contender for our workload. 
>>>>>>>>> The
>>>>>>>>> only worry I have is EBS outages, which have happened.
>>>>>>>>>
>>>>>>>>> On Sunday, January 31, 2016, Jeff Jirsa <
>>>>>>>>> jeff.ji...@crowdstrike.com> wrote:
>>>>>>>>>
>>>>>>>>>> Also in that video - it's long but worth watching
>>>>>>>>>>
>>>>>>>>>> We tested up to 1M reads/second as well, blowing out page cache
>>>>>>>>>> to ensure we weren't "just" reading from memory
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jeff Jirsa
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Jan 31, 2016, at 9:52 AM, Jack Krupansky <
>>>>>>>>>> jack.krupan...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> How about reads? Any differences between read-intensive and
>>>>>>>>>> write-intensive workloads?
>>>>>>>>>>
>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>
>>>>>>>>>> On Sun, Jan 31, 2016 at 3:13 AM, Jeff Jirsa <
>>>>>>>>>> jeff.ji...@crowdstrike.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi John,
>>>>>>>>>>>
>>>>>>>>>>> We run using 4T GP2 volumes, which guarantee 10k iops. Even at
>>>>>>>>>>> 1M writes per second on 60 nodes, we didn’t come close to hitting 
>>>>>>>>>>> even 50%
>>>>>>>>>>> utilization (10k is more than enough for most workloads). PIOPS is 
>>>>>>>>>>> not
>>>>>>>>>>> necessary.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> From: John Wong
>>>>>>>>>>> Reply-To: "user@cassandra.apache.org"
>>>>>>>>>>> Date: Saturday, January 30, 2016 at 3:07 PM
>>>>>>>>>>> To: "user@cassandra.apache.org"
>>>>>>>>>>> Subject: Re: EC2 storage options for C*
>>>>>>>>>>>
>>>>>>>>>>> For production I'd stick with ephemeral disks (aka instance
>>>>>>>>>>> storage) if you have running a lot of transaction.
>>>>>>>>>>> However, for regular small testing/qa cluster, or something you
>>>>>>>>>>> know you want to reload often, EBS is definitely good enough and we 
>>>>>>>>>>> haven't
>>>>>>>>>>> had issues 99%. The 1% is kind of anomaly where we have flush 
>>>>>>>>>>> blocked.
>>>>>>>>>>>
>>>>>>>>>>> But Jeff, kudo that you are able to use EBS. I didn't go through
>>>>>>>>>>> the video, do you actually use PIOPS or just standard GP2 in your
>>>>>>>>>>> production cluster?
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jan 30, 2016 at 1:28 PM, Bryan Cheng <
>>>>>>>>>>> br...@blockcypher.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yep, that motivated my question "Do you have any idea what
>>>>>>>>>>>> kind of disk performance you need?". If you need the performance, 
>>>>>>>>>>>> its hard
>>>>>>>>>>>> to beat ephemeral SSD in RAID 0 on EC2, and its a solid, battle 
>>>>>>>>>>>> tested
>>>>>>>>>>>> configuration. If you don't, though, EBS GP2 will save a _lot_ of 
>>>>>>>>>>>> headache.
>>>>>>>>>>>>
>>>>>>>>>>>> Personally, on small clusters like ours (12 nodes), we've found
>>>>>>>>>>>> our choice of instance dictated much more by the balance of price, 
>>>>>>>>>>>> CPU, and
>>>>>>>>>>>> memory. We're using GP2 SSD and we find that for our patterns the 
>>>>>>>>>>>> disk is
>>>>>>>>>>>> rarely the bottleneck. YMMV, of course.
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jan 29, 2016 at 7:32 PM, Jeff Jirsa <
>>>>>>>>>>>> jeff.ji...@crowdstrike.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> If you have to ask that question, I strongly recommend m4 or
>>>>>>>>>>>>> c4 instances with GP2 EBS.  When you don’t care about replacing a 
>>>>>>>>>>>>> node
>>>>>>>>>>>>> because of an instance failure, go with i2+ephemerals. Until 
>>>>>>>>>>>>> then, GP2 EBS
>>>>>>>>>>>>> is capable of amazing things, and greatly simplifies life.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We gave a talk on this topic at both Cassandra Summit and AWS
>>>>>>>>>>>>> re:Invent: https://www.youtube.com/watch?v=1R-mgOcOSd4 It’s
>>>>>>>>>>>>> very much a viable option, despite any old documents online that 
>>>>>>>>>>>>> say
>>>>>>>>>>>>> otherwise.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> From: Eric Plowe
>>>>>>>>>>>>> Reply-To: "user@cassandra.apache.org"
>>>>>>>>>>>>> Date: Friday, January 29, 2016 at 4:33 PM
>>>>>>>>>>>>> To: "user@cassandra.apache.org"
>>>>>>>>>>>>> Subject: EC2 storage options for C*
>>>>>>>>>>>>>
>>>>>>>>>>>>> My company is planning on rolling out a C* cluster in EC2. We
>>>>>>>>>>>>> are thinking about going with ephemeral SSDs. The question is 
>>>>>>>>>>>>> this: Should
>>>>>>>>>>>>> we put two in RAID 0 or just go with one? We currently run a 
>>>>>>>>>>>>> cluster in our
>>>>>>>>>>>>> data center with 2 250gig Samsung 850 EVO's in RAID 0 and we are 
>>>>>>>>>>>>> happy with
>>>>>>>>>>>>> the performance we are seeing thus far.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Eric
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Steve Robenalt
>>>>> Software Architect
>>>>> sroben...@highwire.org <bza...@highwire.org>
>>>>> (office/cell): 916-505-1785
>>>>>
>>>>> HighWire Press, Inc.
>>>>> 425 Broadway St, Redwood City, CA 94063
>>>>> www.highwire.org
>>>>>
>>>>> Technology for Scholarly Communication
>>>>>
>>>>
>>>>
>>> --
>> Ben Bromhead
>> CTO | Instaclustr
>> +1 650 284 9692
>>
>
>

Re: Any tips on how to track down why Cassandra won't cluster?

2016-02-03 Thread Bryan Cheng

> On Wed, 3 Feb 2016 at 11:49 Richard L. Burton III 
> wrote:
>
>>
>> Any suggestions on how to track down what might trigger this problem? I'm
>> not receiving any exceptions.
>>
>
You're not getting "Unable to gossip with any seeds" on the second node?
What does nodetool status show on both machines?

Re: EC2 storage options for C*

2016-01-30 Thread Bryan Cheng

Yep, that motivated my question "Do you have any idea what kind of disk
performance you need?". If you need the performance, its hard to beat
ephemeral SSD in RAID 0 on EC2, and its a solid, battle tested
configuration. If you don't, though, EBS GP2 will save a _lot_ of headache.

Personally, on small clusters like ours (12 nodes), we've found our choice
of instance dictated much more by the balance of price, CPU, and memory.
We're using GP2 SSD and we find that for our patterns the disk is rarely
the bottleneck. YMMV, of course.

On Fri, Jan 29, 2016 at 7:32 PM, Jeff Jirsa 
wrote:

> If you have to ask that question, I strongly recommend m4 or c4 instances
> with GP2 EBS.  When you don’t care about replacing a node because of an
> instance failure, go with i2+ephemerals. Until then, GP2 EBS is capable of
> amazing things, and greatly simplifies life.
>
> We gave a talk on this topic at both Cassandra Summit and AWS re:Invent:
> https://www.youtube.com/watch?v=1R-mgOcOSd4 It’s very much a viable
> option, despite any old documents online that say otherwise.
>
>
>
> From: Eric Plowe
> Reply-To: "user@cassandra.apache.org"
> Date: Friday, January 29, 2016 at 4:33 PM
> To: "user@cassandra.apache.org"
> Subject: EC2 storage options for C*
>
> My company is planning on rolling out a C* cluster in EC2. We are thinking
> about going with ephemeral SSDs. The question is this: Should we put two in
> RAID 0 or just go with one? We currently run a cluster in our data center
> with 2 250gig Samsung 850 EVO's in RAID 0 and we are happy with the
> performance we are seeing thus far.
>
> Thanks!
>
> Eric
>

Re: Session timeout

2016-01-29 Thread Bryan Cheng

To throw my (unsolicited) 2 cents into the ring, Oleg, you work for a
well-funded and fairly large company. You are certainly free to continue
using the list and asking for community support (I am definitely not in any
position to tell you otherwise, anyway), but that community support is by
definition ad-hoc and best effort. Furthermore, your questions range from
trivial to, as Jonathan as mentioned earlier, concepts that many of us have
no reason to consider at this time (perhaps your work will convince us
otherwise- but you'll need to finish it first ;) )

What I'm getting at here is that perhaps, if you need faster, deeper level,
and more elaborate support than this list can provide, you should look into
the services of a paid Cassandra support company like Datastax.

On Fri, Jan 29, 2016 at 3:34 PM, Robert Coli  wrote:

> On Fri, Jan 29, 2016 at 3:12 PM, Jack Krupansky 
> wrote:
>
>> One last time, I'll simply renew my objection to the way you are abusing
>> this list.
>>
>
> FWIW, while I appreciate that OP (Oleg) is attempting to do a service for
> the community, I agree that the flood of single topic, context-lacking
> posts regarding deep internals of Cassandra is likely to inspire the
> opposite of a helpful response.
>
> This is important work, however, so hopefully we can collectively find a
> way through the meta and can discuss this topic without acrimony! :D
>
> =Rob
>
>

Re: EC2 storage options for C*

2016-01-29 Thread Bryan Cheng

Do you have any idea what kind of disk performance you need?

Cassandra with RAID 0 is a fairly common configuration (Al's awesome tuning
guide has a blurb on it
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html), so if
you feel comfortable with the operational overhead it seems like a solid
choice.

To clarify, though,  by "just one", do you mean just using one of two
available ephemeral disks available to the instance, or are you evaluating
different instance types based on one disk vs two?

On Fri, Jan 29, 2016 at 4:33 PM, Eric Plowe  wrote:

> My company is planning on rolling out a C* cluster in EC2. We are thinking
> about going with ephemeral SSDs. The question is this: Should we put two in
> RAID 0 or just go with one? We currently run a cluster in our data center
> with 2 250gig Samsung 850 EVO's in RAID 0 and we are happy with the
> performance we are seeing thus far.
>
> Thanks!
>
> Eric
>

Help debugging a very slow query

2016-01-13 Thread Bryan Cheng

Hi list,

Would appreciate some insight into some irregular performance we're seeing.

We have a column family that has become problematic recently. We've noticed
a few queries take enormous amounts of time, and seem to clog up read
resources on the machine (read pending tasks pile up, then immediately are
relieved).

I've included the output of cfhistograms on this keyspace[1]. The latencies
sampled do not include one of these problematic partitions, but show two
things: 1) the vast majority of queries to this table seem to be healthy,
and 2) that the maximum partition size is absurd (4139110981 bytes).

This particular cf is not expected to be updated beyond an initial set of
writes, but can be read many times. The data model includes several hashes
that amount to a few KB at most, a set that can hit ~30-40 entries,
and three list that reach a hundred or so entries at most. There
doesn't appear to be any material difference in the size or character of
the data saved between "good" and "bad" partitions. Often, the same
extremely slow partition queried with consistency ONE returns cleanly and
very quickly against other replicas.

I've included a trace of one of these slow returns[2], which I find very
strange: The vast majority of operations are very quick, but the final step
is extremely slow. Nothing exceeds 2ms until the final "Read 1 live and 0
tombstone cells" which takes a whopping 69 seconds [!!]. We've checked our
garbage collection in this time period and have not noticed any significant
collections.

As far as I can tell, the trace doesn't raise any red flags, and we're
largely stumped.

We've got two main questions:

1) What's up with the megapartition? What's the best way to debug this? Our
data model is largely write once, we don't do any updates. We do DELETE,
but the partitions that are giving us issues haven't been removed. We had
some suspicions on https://issues.apache.org/jira/browse/CASSANDRA-10547,
but that seems to largely be triggered by UPDATE operations.

2) What could cause the Read to take such an absurd amount of time when
it's a pair of sstables and the memtable being examined, and its just a
single cell being read? We originally suspected just memory pressure from
huge sstables, but without a corresponding GC this seems unlikely?

Any ideas?

Thanks in advance!

--Bryan


[1]
Percentile  SSTables Write Latency  Read LatencyPartition Size
   Cell Count
  (micros)  (micros)   (bytes)

50% 1.00 35.00 72.00  1109
   14
75% 1.00 50.00149.00  1331
   17
95% 1.00 72.00924.00  4768
   35
98% 2.00103.00   1597.00  9887
   72
99% 2.00149.00   1597.00 14237
  103
Min 0.00 15.00 25.0043
0
Max 2.00258.00   6866.004139110981
20501

[2]

[
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f524890-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Parsing select * from pooltx where hash =
0x5f805c68d66e7d271361e7774a7eeec0591eb5197d4f420126cea83171f0a8ff;",
"Source": "172.31.54.46",
"SourceElapsed": 26
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f526fa0-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Preparing statement",
"Source": "172.31.54.46",
"SourceElapsed": 79
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f52bdc0-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Executing single-partition query on pooltx",
"Source": "172.31.54.46",
"SourceElapsed": 1014
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f52e4d0-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Acquiring sstable references",
"Source": "172.31.54.46",
"SourceElapsed": 1016
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f530be0-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Merging memtable tombstones",
"Source": "172.31.54.46",
"SourceElapsed": 1029
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f5332f0-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Bloom filter allows skipping sstable 387133",
"Source": "172.31.54.46",
"SourceElapsed": 1040
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f535a00-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Key cache hit for sstable 386331",
"Source": "172.31.54.46",
"SourceElapsed": 1046
  },
  {
"Sessionid": "4f51fa70-ba2f-11e5-8729-e1d125cb9b2d",
"Eventid": "4f538110-ba2f-11e5-8729-e1d125cb9b2d",
"Activity": "Seeking to

Re: max connection per user

2016-01-13 Thread Bryan Cheng

Are you actively exposing your database to users outside of your
organization, or are you just asking about security best practices?

If you mean the former, this isn't really a common use case and there isn't
a huge amount out of the box that Cassandra will do to help.

If you're just asking about security best-practices,
http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf
has a brief blurb, and there are many resources online for securing
Cassandra specifically and databases in general- the approaches are going
to be largely the same.

Can you describe what avenues you're expecting either intrusion or DOS?

On Wed, Jan 13, 2016 at 6:01 PM, oleg yusim  wrote:

> OK Rob, I see what you saying. Well, let's dive into the long questions
> and answers at this case a bit:
>
> 1) Is there any other approach Cassandra currently utilizes to mitigate
> DoS attacks?
> 2) How about max connection per DB? I know, Cassandra has this parameter
> on JDBC driver configuration, but what be suggested value not to exceed?
>
> Thanks,
>
> Oleg
>
> On Wed, Jan 13, 2016 at 6:31 PM, Robert Coli  wrote:
>
>> On Wed, Jan 13, 2016 at 1:41 PM, oleg yusim  wrote:
>>
>>> Quick question, here: does Cassandra have a configuration switch to
>>> limit number of connections per user (protection of DoS attack, security)?
>>>
>>
>> Quick answer : no.
>>
>> =Rob
>>
>>
>
>

Re: Rebuilding a new Cassandra node at 100Mb/s

2015-12-03 Thread Bryan Cheng

Jonathan: Have you changed stream_throughput_outbound_megabits_per_sec in
cassandra.yaml?

# Throttles all outbound streaming file transfers on this node to the
# given total throughput in Mbps. This is necessary because Cassandra does
# mostly sequential IO when streaming data during bootstrap or repair, which
# can lead to saturating the network connection and degrading rpc
performance.
# When unset, the default is 200 Mbps or 25 MB/s.
# stream_throughput_outbound_megabits_per_sec: 200

On Thu, Dec 3, 2015 at 11:32 AM, Robert Coli  wrote:

> On Thu, Dec 3, 2015 at 7:51 AM, Jonathan Ballet 
> wrote:
>
>> I noticed it's not really fast and my monitoring system shows that the
>> traffic incoming on this node is exactly at 100Mb/s (12.6MB/s). I know it
>> can be much more than that (I just tested sending a file through SSH
>> between the two machines and it goes up to 1Gb/s), is there a limitation of
>> some sort on Cassandra which limit the transfer rate to 100Mb/s?
>>
>
> Probably limited by number of simultaneous parallel streams. Many people
> do not want streams to go "as fast as possible" because their priority is
> maintaining baseline service times while rebuilding/bootstrapping.
>
> Not sure there's a way to tune it, but this is definitely on the "large
> node" radar..
>
> =Rob
>
>

Re: Transitioning to incremental repair

2015-12-02 Thread Bryan Cheng

Ah Marcus, that looks very promising- unfortunately we have already
switched back to full repairs and our test cluster has been re-purposed for
other tasks atm. I will be sure to apply the patch/try a fixed version of
Cassandra if we attempt to migrate to incremental repair again.

Re: Issues on upgrading from 2.2.3 to 3.0

2015-12-02 Thread Bryan Cheng

Has your configuration changed?

This is a new check- https://issues.apache.org/jira/browse/CASSANDRA-10242.
It seems likely either your snitch changed, your properties changed, or
something caused Cassandra to think one of the two happened...

What's your node layout?

On Fri, Nov 27, 2015 at 6:45 PM, Carlos A  wrote:

> Hello all,
>
> I had 2 of my systems upgraded to 3.0 from the same previous version.
>
> The first cluster seem to be fine.
>
> But the second, each node starts and then fails.
>
> On the log I have the following on all of them:
>
> INFO  [main] 2015-11-27 19:40:21,168 ColumnFamilyStore.java:381 -
> Initializing system_schema.keyspaces
> INFO  [main] 2015-11-27 19:40:21,177 ColumnFamilyStore.java:381 -
> Initializing system_schema.tables
> INFO  [main] 2015-11-27 19:40:21,185 ColumnFamilyStore.java:381 -
> Initializing system_schema.columns
> INFO  [main] 2015-11-27 19:40:21,192 ColumnFamilyStore.java:381 -
> Initializing system_schema.triggers
> INFO  [main] 2015-11-27 19:40:21,198 ColumnFamilyStore.java:381 -
> Initializing system_schema.dropped_columns
> INFO  [main] 2015-11-27 19:40:21,203 ColumnFamilyStore.java:381 -
> Initializing system_schema.views
> INFO  [main] 2015-11-27 19:40:21,208 ColumnFamilyStore.java:381 -
> Initializing system_schema.types
> INFO  [main] 2015-11-27 19:40:21,215 ColumnFamilyStore.java:381 -
> Initializing system_schema.functions
> INFO  [main] 2015-11-27 19:40:21,220 ColumnFamilyStore.java:381 -
> Initializing system_schema.aggregates
> INFO  [main] 2015-11-27 19:40:21,225 ColumnFamilyStore.java:381 -
> Initializing system_schema.indexes
> ERROR [main] 2015-11-27 19:40:21,831 CassandraDaemon.java:250 - Cannot
> start node if snitch's rack differs from previous rack. Please fix the
> snitch or decommission and rebootstrap this node.
>
> It asks to "Please fix the snitch or decommission and rebootstrap this
> node"
>
> If none of the nodes can go up, how can I decommission all of them?
>
> Doesn't make sense.
>
> Any suggestions?
>
> Thanks,
>
> C.
>

Re: Transitioning to incremental repair

2015-12-01 Thread Bryan Cheng

Sorry if I misunderstood, but are you asking about the LCS case?

Based on our experience, I would absolutely recommend you continue with the
migration procedure. Even if the compaction strategy is the same, the
process of anticompaction is incredibly painful. We observed our test
cluster running 2.1.11 experiencing a dramatic increase in latency and not
responding to nodetool queries over JMX while anticompacting the largest
SSTables. This procedure also took several times longer than a standard
full repair.

If you absolutely cannot perform the migration procedure, I believe 2.2.x
contains the changes to automatically set the RepairedAt flags after a full
repair, so you may be able to do a full repair on 2.2.x and then transition
directly to incremental without migrating (can someone confirm?)

Generalized download link?

2015-11-16 Thread Bryan Cheng

Hey list,

Is there a URL available for downloading Cassandra that abstracts away the
mirror selection (eg. just 302's to a mirror URL?) We've got a few
self-configuring Cassandras (for example, the Docker container our devs
use), and using the same mirror for the containers or for any bulk
provisioning operation seems like bad table manners.

Re: Repair Hangs while requesting Merkle Trees

2015-11-16 Thread Bryan Cheng

Hi Anuj,

Did you mean streaming_socket_timeout_in_ms? If not, then you definitely
want that set. Even the best network connections will break occasionally,
and in Cassandra < 2.1.10 (I believe) this would leave those connections
hanging indefinitely on one end.

How far away are your two DC's from a network perspective, out of
curiosity? You'll almost certainly be doing different TCP stack tuning for
cross-DC, notably your buffer sizes, window params, cassandra-specific
stuff like otc_coalescing_strategy, inter_dc_tcp_nodelay, etc.

On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra 
wrote:

> One more observation.We observed that there are few TCP connections which
> node shows as Established but when we go to node at other end,connection is
> not there. They are called "phantom" connections I guess. Can this be a
> possible cause?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> 
> --
> *From*:"Anuj Wadehra" 
> *Date*:Sat, 14 Nov, 2015 at 11:59 pm
>
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
>
> Thanks Daemeon !!
>
> I wil capture the output of netstats and share in next few days. We were
> thinking of taking tcp dumps also. If its a network issue and increasing
> request timeout worked, not sure how Cassandra is dropping messages based
> on timeout.Repair messages are non droppable and not supposed to be
> timedout.
>
> 2 of the 3 nodes in the DC are able to complete repair without any issue.
> Just one node is problematic.
>
> I also observed frequent messages in logs of other nodes which say that
> hints replay timedout..and the node where hints were being replayed is
> always a remote dc node. Is it related some how?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> 
> --
> *From*:"daemeon reiydelle" 
> *Date*:Thu, 12 Nov, 2015 at 10:34 am
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
>
>
> Have you checked the network statistics on that machine? (netstats -tas)
> while attempting to repair ... if netstats show ANY issues you have a
> problem. If you can put the command in a loop running every 60 seconds for
> maybe 15 minutes and post back?
>
> Out of curiousity, how many remote DC nodes are getting successfully
> repaired?
>
>
>
> *...*
>
>
>
>
>
>
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra 
> wrote:
>
>> Hi,
>>
>> we are using 2.0.14. We have 2 DCs at remote locations with 10GBps
>> connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only
>> one node in DC2, we are unable to complete repair as it always hangs. Node
>> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never
>> show that they sent the merkle tree reply to requesting node.
>> Repair hangs infinitely.
>>
>> After increasing request_timeout_in_ms on affected node, we were able to
>> successfully run repair on one of the two occassions.
>>
>> Any comments, why this is happening on just one node? In
>> OutboundTcpConnection.java,  when isTimeOut method always returns false for
>> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
>> increasing request timeout solved problem on one occasion ?
>>
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>> On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <
>> anujw_2...@yahoo.co.in> wrote:
>>
>>
>> Hi,
>>
>> We have 2 DCs at remote locations with 10GBps connectivity.We are able to
>> complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are
>> unable to complete repair as it always hangs. Node sends Merkle Tree
>> requests, but one or more nodes in DC1 (remote) never show that they sent
>> the merkle tree reply to requesting node.
>> Repair hangs infinitely.
>>
>> After increasing request_timeout_in_ms on affected node, we were able to
>> successfully run repair on one of the two occassions.
>>
>> Any comments, why this is happening on just one node? In
>> OutboundTcpConnection.java,  when isTimeOut method always returns false for
>> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
>> increasing request timeout solved problem on one occasion ?
>>
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>

Re: Too many open files Cassandra 2.1.11.872

2015-11-06 Thread Bryan Cheng

Is your compaction progressing as expected? If not, this may cause an
excessive number of tiny db files. Had a node refuse to start recently
because of this, had to temporarily remove limits on that process.

On Fri, Nov 6, 2015 at 10:09 AM, Jason Lewis  wrote:

> I'm getting too many open files errors and I'm wondering what the
> cause may be.
>
> lsof -n | grep java  show 1.4M files
>
> ~90k are inodes
> ~70k are pipes
> ~500k are cassandra services in /usr
> ~700K are the data files.
>
> What might be causing so many files to be open?
>
> jas
>

Re: Insertion Delay Cassandra 2.1.9

2015-11-06 Thread Bryan Cheng

Your experience, then, is expected (although 20m delay seems excessive, and
is a sign you may be overloading your cluster, which may be expected with
an unthrottled bulk load like that).

When you insert with consistency ONE on RF > 1, that means your query
returns after one node confirms the write. The write will attempt to go out
to the other nodes that are responsible for that row, but the coordinator
does not bother waiting for the response. If your nodes are overloaded,
they may not accept the write at all; failures may result in hinted handoff
being used, or just the write being dropped in general.

At the end of your load, you likely have nodes missing writes. Look for
dropped MUTATION messages in your nodetool tpstats. For operations that
cannot tolerate this, you need to write and read with a higher consistency
level.

Consistency is achieved over time via hinted handoff, read repair, and
other mechanics (assuming you're not running a repair in between). Your
cluster will gradually return to consistency, *provided your nodes do not
suffer any downtime or exceed the hint window in terms of unavailability*.

On Fri, Nov 6, 2015 at 10:58 AM, Greg Traub  wrote:

> Vidur,
>
> Forgive me if I'm getting this wrong as I'm exceptionally new to Cassandra.
>
> By consistency, if you mean the USING CONSISTENCY clause, then I'm not
> specifying it which, per the CQL documentation, means a default of ONE.
>
> On Fri, Nov 6, 2015 at 1:49 PM, Vidur Malik  wrote:
>
>> What is your query consistency?
>>
>> On Fri, Nov 6, 2015 at 1:47 PM, Greg Traub 
>> wrote:
>>
>>> Cassandra users,
>>>
>>> I have a 4 node Cassandra cluster set up.  All nodes are in a single
>>> rack and distribution center.  I have a loader program which loads 40
>>> million rows into a table in a keyspace with a replication factor of 3.
>>> Immediately after inserting the rows (after the loader program finishes),
>>> if I SELECT count(*) from the table, the result is less than 40 million.
>>> If I run our dumper program to retrieve all rows, it is less than 40
>>> million.  However, if I wait roughly 20 minutes, the count eventually
>>> reaches 40 million rows and the dumper program returns all 40 million.
>>>
>>> If I do the same thing in a keyspace where the replication factor is 1,
>>> I don't have any "stabilization" time and the 40 million rows are
>>> immediately available.
>>>
>>> I've modified the loading and dumping programs to use both the Thrift
>>> Java driver and the CQL Java driver and neither seems to make a difference.
>>>
>>> I'm very new to Cassandra and my questions are, what may be causing this
>>> delay in all rows being available and how might I lessen/eliminate this
>>> delay?
>>>
>>> Thanks,
>>> Greg
>>>
>>
>>
>>
>> --
>>
>> Vidur Malik
>>
>> [image: ShopKeep] 
>>
>> 800.820.9814
>> <8008209814> [image: ShopKeep]  [image:
>> ShopKeep]  [image: ShopKeep]
>> 
>>
>
>

What are the repercussions of a restart during anticompaction?

2015-11-05 Thread Bryan Cheng

Hey list,

Tried to find an answer to this elsewhere, but turned up nothing.

We ran our first incremental repair after a large dc migration two days
ago; the cluster had been running full repairs prior to this during the
migration. Our nodes are currently going through anticompaction, as
expected.

However, two days later, there is little to no apparent progress on this
process. The compaction count does increase, in bursts, but compactionstats
hangs with no response. We're seeing our disk space footprint grow steadily
as well. The number of sstables on disk is reaching high levels.

In the past, when our compactions seem to hang, a restart seems to move
things along; at the very least, it seems to allow JMX to respond. However,
I'm not sure of the repercussions of a restart during anticompaction.

Given my understanding of anticompaction, my expectation would be that the
sstables that had been split and marked repaired would remain that way, the
ones that had not yet been split would be left as unrepaired and some
ranges would probably be re-repaired on the next incremental repair, and
the machine would do standard compaction among the two sets (repaired vs
unrepaired). In other words, we wouldn't lose any progress in incremental
repair + anticompaction, but some repaired data would get re-repaired. Does
this seem reasonable?

Should I just let this anticompaction run its course? We did the migration
procedure (marking sstables as repaired) awhile ago, but did a full repair
again after that before we decommissioned our old dc.

Any guidance would be appreciated! Thanks,

Bryan

Re: Two node cassandra cluster doubts

2015-11-04 Thread Bryan Cheng

I believe what's going on here is this step:

Select Count (*) From MYTABLE;---> 15 rows

Shut down Node B.

Start Up Node B.

Select Count (*) From MYTABLE;---> 15 rows

To understand why this is an issue, consider the way that consistency is
attempted within Cassandra. With RF=2, (You should really use an odd number
RF and LOCAL_QUORUM so you can tolerate a node failure, but that's another
thing), your write is hitting Node B, and being queued for writing to Node
A via a process called hinted handoff. Normally, this handoff occurs when
Node A returns to the cluster, up to max_hint_window_in_ms later, causing
all writes it missed to be replayed and integrated. However, since Node B
also goes down during this time period, it loses the queued hints and
therefore Node A never gets that write.

You may see this flip flopping due to your query hitting Node A and Node B
alternately (you can use trace to verify this).

Keep in mind that due to Cassandra's architecture, missing writes will
result in inconsistent data. There are mechanisms to help mitigate this,
for example the aforementioned hinted handoff, or read repair. However, at
the end of the day the only way to ensure consistent data is a repair.
These mechanisms cannot operate reliably if the entire cluster goes down-
which happens in your scenario between the above steps.

On Mon, Nov 2, 2015 at 12:46 PM, Luis Miguel  wrote:

> Thanks for your answer!
> I thought that bootstrapping is executed only when you add a node to the
> cluster the first time after that I thought tgat gossip is the method used
> to discover the cluster members againIn my case I thought that it was
> more about a read repair issue.., am I wrong?
>
> --
> Date: Mon, 2 Nov 2015 21:12:20 +0100
> Subject: Re: FW: Two node cassandra cluster doubts
> From: ichi.s...@gmail.com
> To: user@cassandra.apache.org
>
>
> I think that this is a normal behaviour as you shut down your seed and
> then reboot it. You should know that when you start a seed node it doesn't
> do the bootstrapping thing. Which means it doesn't look if there are
> changes in the contents of the tables. In here in your tests, you shut down
> node A before doing the inserts and started it after. So you node A doesn't
> have the new rows you inserted. And yes it is normal to have  different
> values of your query each time. Because the coordinator node changes and
> therfore  the query is executed each time on a different node ( when  node
> B answers you've got 15 rows and WHE  node A does you have 10 rows)
> Le 2 nov. 2015 19:22, "Luis Miguel"  a écrit :
>
> Hello!
>
> I have set a cassandra cluster with two nodes, Node A  and Node B --> RF=2,
> Read CL=1 and Write CL = 1;
>
> Node A is seed...
>
>
> At first everything is working well, when I add/delete/update entries on
> Node A, everything is replicated on Node B and vice-versa, even if I shut
> down node A, and I made new insertions on Node B meanwhile, and After that
> I start up node A again Cassandra recovers OKBUT there is ONE case when
> this situation fails I am going to describe the process:
>
> Node A and Node B are sync.
>
> Select Count (*) From MYTABLE;---> 10 rows
>
> Shut down Node A.
>
> Made some inserts on Node B.
>
> Select Count (*) From MYTABLE;---> 15 rows
>
> Shut down Node B.
>
> Start Up Node B.
>
> Select Count (*) From MYTABLE;---> 15 rows
>
> (Everything Ok, yet).
>
> Start Up Node A.
>
> Select Count (*) From MYTABLE;---> 10 rows (uhmmm...this is weird...check
> it again)
> Select Count (*) From MYTABLE;---> 15 rows  (wow!..this is correct, lets
> try again)
> Select Count (*) From MYTABLE;---> 10 rows (Ok...values are dancing)
>
> If I made the same queries on NODE B it Behaves the same way and it
> only is solved with a nodetool repair...but I would prefer an automatic
> fail-over...
>
> is there any way to avoid this??? or a nodetool repair execution is
> mandatory???
>
> Thanks in advance!!!
>
>

Re: Doubt regarding consistency-level in Cassandra-2.1.10

2015-11-03 Thread Bryan Cheng

What Eric means is that SERIAL consistency is a special type of consistency
that is only invoked for a subset of operations: those that use
CAS/lightweight transactions, for example "IF NOT EXISTS" queries.

The differences between CAS operations and standard operations are
significant and there are large repercussions for tunable consistency. The
amount of time such an operation takes is greatly increased as well; you
may need to increase your internal node-to-node timeouts .

On Mon, Nov 2, 2015 at 8:01 PM, Ajay Garg  wrote:

> Hi Eric,
>
> I am sorry, but I don't understand.
>
> If there had been some issue in the configuration, then the
> consistency-issue would be seen everytime (I guess).
> As of now, the error is seen sometimes (probably 30% of times).
>
> On Mon, Nov 2, 2015 at 10:24 PM, Eric Stevens  wrote:
>
>> Serial consistency gets invoked at the protocol level when doing
>> lightweight transactions such as CAS operations.  If you're expecting that
>> your topology is RF=2, N=2, it seems like some keyspace has RF=3, and so
>> there aren't enough nodes available to satisfy serial consistency.
>>
>> See
>> http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_ltwt_transaction_c.html
>>
>> On Mon, Nov 2, 2015 at 1:29 AM Ajay Garg  wrote:
>>
>>> Hi All.
>>>
>>> I have a 2*2 Network-Topology Replication setup, and I run my
>>> application via DataStax-driver.
>>>
>>> I frequently get the errors of type ::
>>> *Cassandra timeout during write query at consistency SERIAL (3 replica
>>> were required but only 0 acknowledged the write)*
>>>
>>> I have already tried passing a "write-options with LOCAL_QUORUM
>>> consistency-level" in all create/save statements, but I still get this
>>> error.
>>>
>>> Does something else need to be changed in /etc/cassandra/cassandra.yaml
>>> too?
>>> Or may be some another place?
>>>
>>>
>>> --
>>> Regards,
>>> Ajay
>>>
>>
>
>
> --
> Regards,
> Ajay
>

Re: Maximum node decommission // bootstrap at once.

2015-10-06 Thread Bryan Cheng

Honestly, we've had more luck bootstrapping in our old DC (defining
topology properties as the new DC) and using rsync to migrate the data
files to new machines in the new datacenter. We had 10gig within the
datacenter but significantly less than this cross-DC, which lead to a lot
of broken streaming pipes and wasted effort. This might make sense
depending on your link quality and the resources/time you have available to
do TCP tuning,

On Tue, Oct 6, 2015 at 1:29 PM, Kevin Burton  wrote:

> I'm not sure which is faster/easier.  Just joining one box at a time and
> then decommissioning or using replace_address.
>
> this stuff is always something you do rarely and then more complex than it
> needs to be.
>
> This complicates long term migration too.  Having to have gigabit is
> somewhat of a problem in that you might now actually have it where you're
> going.
>
> We're migrating from Washington, DC to Germany so we have to change TCP
> send/receive buffers to get decent bandwidth.
>
> But I think we can do this at 1Gb per so per box.
>
>
> On Tue, Oct 6, 2015 at 12:48 PM, Robert Coli  wrote:
>
>> On Tue, Oct 6, 2015 at 12:32 PM, Kevin Burton  wrote:
>>
>>> How many nodes can we bootstrap at once?  How many can we decommission?
>>>
>>
>> short answer : 1 node can join or part at simultaneously
>>
>> longer answer : https://issues.apache.org/jira/browse/CASSANDRA-2434 /
>> https://issues.apache.org/jira/browse/CASSANDRA-7069 /
>> -Dconsistent.rangemovement
>>
>> Have you considered using replace_address to replace your existing 13
>> nodes, at which point you just have to join 17 more?
>>
>> =Rob
>>
>>
>
>
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>

Re: Maximum node decommission // bootstrap at once.

2015-10-06 Thread Bryan Cheng

Robert, I might be misinterpreting you but I *think* your link is talking
about bootstrapping a new node by bulk loading replica data from your
existing cluster?

I was referring to using Cassandra's bootstrap to get the node to join and
run (as a member of DC2 but with physical residence in DC1), and then
transfer the /data directory to a new machine to assume the identity of the
old. I *believe* that from the cluster point of view this is just the node
being down for an extended period of time (so the usual caveats apply)

On Tue, Oct 6, 2015 at 2:20 PM, Robert Coli  wrote:

> On Tue, Oct 6, 2015 at 2:14 PM, Kevin Burton  wrote:
>
>> Plan be we will just rsync the data.. Does it pretty much work just by
>> putting the data in a directory or do you have to do anything special?
>>
>
> http://www.pythian.com/blog/bulk-loading-options-for-cassandra/
>
> Be careful, with vnodes the rsync approach gets meaningfully harder.
>
> =Rob
>

Re: broadcast address on EC2 without Elastic IPs.

2015-10-01 Thread Bryan Cheng

Hey Renato,

As far as I can tell, the reason you're getting private IP addresses back
is that the node you're connecting to is relaying back the way that _it_
knows where to find other nodes, which is a function of the gossip state.
This is expected behavior.

Mixed Private/Public IP spaces without full connectivity between both sets
of address spaces is always going to be a pain, IMHO; you're much better
off standardizing on one or the other.

It sounds like you have a machine (maybe a dev machine?) outside of EC2
trying to reach your cluster. If this is just for development, then the
"easiest" mechanism would be to standardize on public IPs; elastic IP's are
free as long as they're in use, so I would just request a quota increase.

If this is a production configuration (maybe another datacenter?) you'll
probably want to investigate a more robust routing solution. We have two
datacenters with distinct, non-intersecting Private IP spaces; we use a VPN
to route between them. Clients on both sides of the tunnel can natively
speak the private IP's on the other side, which eliminates odd issues from
NAT. From what I've seen on the list, this is a somewhat common
configuration.

Hope this helps!

--Bryan

On Wed, Sep 30, 2015 at 7:24 AM, Renato Perini 
wrote:

> Hello!
> I have configured a small cluster composed of three nodes on Amazon EC2.
> The 3 machines don't have an elastic IP (static address) so the public
> address changes at every reboot.
>
> I have a machine with a static ip that I use as a bridge to access the
> other 3 cassandra nodes through SSH. On this machine, I have setup a
> tunnelling towards the first node of the cluster in order to open the 9042
> port and let me access the cluster through this static IP.
>
> Basically, my cassandra.yaml has these settings:
> listen_address: private IP
> broadcast_address: commented out.
> rpc_address: 0.0.0.0
> broadcast_rpc_address: private ip
>
> I know I should set the broadcast address to the public IP, but it is
> dynamic and I don't have any idea at the moment on how I could determine it
> and setup it in the cassandra.yaml file.
>
> I'm developing a small client using the datastax connector (in Java).
> I setup the contactpoint using the public ip of the bridge machine. The
> client connects but gives some errors while adding other nodes in the
> cluster:
>
> 15:43:26,887 ERROR [com.datastax.driver.core.Session]
> (cluster1-nio-worker-1) Error creating pool to /XXX.XX.XX.XXX:9042:
> com.datastax.driver.core.TransportException: [/XXX.XX.XX.XXX:9042] Cannot
> connect
> at
> com.datastax.driver.core.Connection$1.operationComplete(Connection.java:156)
> [cassandra-driver-core-2.2.0-rc3.jar:]
> at
> com.datastax.driver.core.Connection$1.operationComplete(Connection.java:139)
> [cassandra-driver-core-2.2.0-rc3.jar:]
> at
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:214)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> [netty-common-4.0.27.Final.jar:4.0.27.Final]
> at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_80]
> Caused by: io.netty.channel.ConnectTimeoutException: connection timed out:
> /XXX.XX.XX.XXX:9042
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:212)
> [netty-transport-4.0.27.Final.jar:4.0.27.Final]
> ... 6 more
>
> 15:43:26,887 ERROR [com.datastax.driver.core.Session]
> (cluster1-nio-worker-3) Error creating pool to /XXX.XX.XX.XX:9042:
> com.datastax.driver.core.TransportException: [/XXX.XX.XX.XX:9042] Cannot
> connect
> at
> com.datastax.driver.core.Connection$1.operationComplete(Connection.java:156)
> [cassandra-driver-core-2.2.0-rc3.jar:]
> at
>

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

2015-09-08 Thread Bryan Cheng

Tom, I don't believe so; it seems the symptom would be an indefinite (or
very long) hang.

To clarify, is this issue restricted to LOCAL_QUORUM? Can you issue a
LOCAL_ONE SELECT and retrieve the expected data back?

On Tue, Sep 8, 2015 at 12:02 PM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> Just to be sure: can this bug result in a 0-row result while it should be
> > 0 ?
> Op 8 sep. 2015 6:29 PM schreef "Tyler Hobbs" :
>
> See https://issues.apache.org/jira/browse/CASSANDRA-9753
>>
>> On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> I've been bugging you a few times, but now I've got trace data for a
>>> query with LOCAL_QUORUM that is being sent to a remove data center.
>>>
>>> The setup is as follows:
>>> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
>>> Both DC1 and DC2 have 2 nodes.
>>> In DC2, one node is currently being rebuilt, and therefore does not
>>> contain all data (yet).
>>>
>>> The client app connects to a node in DC1, and sends a SELECT query with
>>> CL LOCAL_QUORUM, which in this case means ((1/2)+1=1.
>>> If all is ok, the query always produces a result, because the requested
>>> rows are guaranteed to be available in DC1.
>>>
>>> However, the query sometimes produces no result. I've been able to
>>> record the traces of these queries, and it turns out that the coordinator
>>> node in DC1 sometimes sends the query to DC2, to the node that is being
>>> rebuilt, and does not have the requested rows. I've included an example
>>> trace below.
>>>
>>> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
>>> is in DC2.
>>> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
>>> sent (I'm using the datastax java driver).
>>>
>>>  activity
>>>| source   | source_elapsed | thread
>>>
>>> ---+--++-
>>>Message received from /
>>> 10.55.156.67 |  10.88.4.194 | 48 |
>>> MessagingService-Incoming-/10.55.156.67
>>>  Executing single-partition query on
>>> aggregate |  10.88.4.194 |286 |
>>> SharedPool-Worker-2
>>>   Acquiring sstable
>>> references |  10.88.4.194 |306 |
>>> SharedPool-Worker-2
>>>Merging memtable
>>> tombstones |  10.88.4.194 |321 |
>>> SharedPool-Worker-2
>>> Partition index lookup allows skipping sstable
>>> 107 |  10.88.4.194 |458 |
>>> SharedPool-Worker-2
>>> Bloom filter allows skipping sstable
>>> 1 |  10.88.4.194 |489 | SharedPool-Worker-2
>>>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to
>>> tombstones |  10.88.4.194 |496 |
>>> SharedPool-Worker-2
>>> Merging data from memtables and 0
>>> sstables |  10.88.4.194 |500 |
>>> SharedPool-Worker-2
>>>  Read 0 live and 0 tombstone
>>> cells |  10.88.4.194 |513 |
>>> SharedPool-Worker-2
>>>Enqueuing response to /
>>> 10.55.156.67 |  10.88.4.194 |613 |
>>> SharedPool-Worker-2
>>>   Sending message to /
>>> 10.55.156.67 |  10.88.4.194 |672 |
>>> MessagingService-Outgoing-/10.55.156.67
>>> Parsing SELECT * FROM Aggregate WHERE type=? AND
>>> typeId=?; | 10.55.156.67 | 10 |
>>> SharedPool-Worker-4
>>>Sending message to /
>>> 10.88.4.194 | 10.55.156.67 |   4335 |
>>>  MessagingService-Outgoing-/10.88.4.194
>>> Message received from /
>>> 10.88.4.194 | 10.55.156.67 |   6328 |
>>>  MessagingService-Incoming-/10.88.4.194
>>>Seeking to partition beginning in data
>>> file | 10.55.156.67 |  10417 |
>>> SharedPool-Worker-3
>>>  Key cache hit for sstable
>>> 389 | 10.55.156.67 |  10586 |
>>> SharedPool-Worker-3
>>>
>>> My question is: how is it possible that the query is sent to a node in
>>> DC2?
>>> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
>>> other node in DC1 if the coordinator does not have a replica, right?
>>>
>>> Thanks,
>>> Tom
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>

Re: How to prevent queries being routed to new DC?

2015-09-03 Thread Bryan Cheng

Hey Tom,

What's your replication strategy look like? When your new nodes join the
ring, can you verify that they show up under a new DC and not as part of
the old?

--Bryan

On Thu, Sep 3, 2015 at 11:27 AM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> I want to start using vnodes in my cluster. To do so, I've set up a new
> data center with the same number of nodes as the existing one, as described
> in
> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configVnodesProduction_t.html.
> The new DC is in the same physical location as the old one.
>
> The problem I'm running into is that as soon as the nodes in the new data
> center are started, the application that is using the nodes in the old data
> center is frequently getting error messages because queries don't return
> the expected data. I'm pretty sure this is because somehow these queries
> are routed to the new, empty data center. The application is not connecting
> to the nodes in the new DC.
>
> I've tried two different things to prevent this:
>
> 1) Ensure that all queries use either LOCAL_ONE or LOCAL_QUORUM
> consistency. Nevertheless, I'm still seeing failed queries.
> 2) Start the new nodes with -Dcassandra.join_ring=false, to prevent them
> from participating in the cluster. Although they don't show up in nodetool
> ring, I'm still seeing failed queries.
>
> If I understand it correctly, both measures should prevent queries from
> ending up in the new DC, but somehow they don't in my situation.
>
> How is it possible that queries are routed to the new, emtpy data center?
> And more importantly, how can I prevent it?
>
> Thanks,
> Tom
>

Re: How to prevent queries being routed to new DC?

2015-09-03 Thread Bryan Cheng

This all seems fine so far. Are you able to see what errors are being
returned?

We had a similar issue where one of our secondary, less used keyspaces was
on a replication strategy that was not DC-aware, which was causing errors
about being unable to satisfy LOCAL_ONE and LOCAL_QUORUM quoroum levels.


On Thu, Sep 3, 2015 at 11:53 AM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> Hi Bryan,
>
> I'm using the PropertyFileSnitch, and it contains entries for all nodes in
> the old DC, and all nodes in the new DC. The replication factor for both
> DCs is 1.
>
> With the first approach I described, the new nodes join the cluster, and
> show up correctly under the new DC, so all seems to be fine.
> With the second approach (join_ring=false), they don't show up at all,
> which is also what I expected.
>
>
> On Thu, Sep 3, 2015 at 8:44 PM, Bryan Cheng <br...@blockcypher.com> wrote:
>
>> Hey Tom,
>>
>> What's your replication strategy look like? When your new nodes join the
>> ring, can you verify that they show up under a new DC and not as part of
>> the old?
>>
>> --Bryan
>>
>> On Thu, Sep 3, 2015 at 11:27 AM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> I want to start using vnodes in my cluster. To do so, I've set up a new
>>> data center with the same number of nodes as the existing one, as described
>>> in
>>> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configVnodesProduction_t.html.
>>> The new DC is in the same physical location as the old one.
>>>
>>> The problem I'm running into is that as soon as the nodes in the new
>>> data center are started, the application that is using the nodes in the old
>>> data center is frequently getting error messages because queries don't
>>> return the expected data. I'm pretty sure this is because somehow these
>>> queries are routed to the new, empty data center. The application is not
>>> connecting to the nodes in the new DC.
>>>
>>> I've tried two different things to prevent this:
>>>
>>> 1) Ensure that all queries use either LOCAL_ONE or LOCAL_QUORUM
>>> consistency. Nevertheless, I'm still seeing failed queries.
>>> 2) Start the new nodes with -Dcassandra.join_ring=false, to prevent them
>>> from participating in the cluster. Although they don't show up in nodetool
>>> ring, I'm still seeing failed queries.
>>>
>>> If I understand it correctly, both measures should prevent queries from
>>> ending up in the new DC, but somehow they don't in my situation.
>>>
>>> How is it possible that queries are routed to the new, emtpy data
>>> center? And more importantly, how can I prevent it?
>>>
>>> Thanks,
>>> Tom
>>>
>>
>>
>

Re: How to prevent queries being routed to new DC?

2015-09-03 Thread Bryan Cheng

Hey Tom,

I'd recommend you enable tracing and do a few queries in a controlled
environment to verify that queries are being routed to your new nodes.
Provided you have followed the procedure outlined above (specifically, have
set auto_bootstrap to false in your new cluster), rebuild has not been run,
the application is not connecting to the new cluster, and all your queries
are run at LOCAL_* quorum levels, I do not believe those queries should be
routed to the new dc.

On Thu, Sep 3, 2015 at 12:14 PM, Tom van den Berge <
tom.vandenbe...@gmail.com> wrote:

> Hi Bryan,
>
> It does not generate any errors. A query for a specific row simply does
> not return the row if it is sent to a node in the new DC. This makes sense,
> because the node is still empty.
>
> On Thu, Sep 3, 2015 at 9:03 PM, Bryan Cheng <br...@blockcypher.com> wrote:
>
>> This all seems fine so far. Are you able to see what errors are being
>> returned?
>>
>> We had a similar issue where one of our secondary, less used keyspaces
>> was on a replication strategy that was not DC-aware, which was causing
>> errors about being unable to satisfy LOCAL_ONE and LOCAL_QUORUM quoroum
>> levels.
>>
>>
>> On Thu, Sep 3, 2015 at 11:53 AM, Tom van den Berge <
>> tom.vandenbe...@gmail.com> wrote:
>>
>>> Hi Bryan,
>>>
>>> I'm using the PropertyFileSnitch, and it contains entries for all nodes
>>> in the old DC, and all nodes in the new DC. The replication factor for both
>>> DCs is 1.
>>>
>>> With the first approach I described, the new nodes join the cluster, and
>>> show up correctly under the new DC, so all seems to be fine.
>>> With the second approach (join_ring=false), they don't show up at all,
>>> which is also what I expected.
>>>
>>>
>>> On Thu, Sep 3, 2015 at 8:44 PM, Bryan Cheng <br...@blockcypher.com>
>>> wrote:
>>>
>>>> Hey Tom,
>>>>
>>>> What's your replication strategy look like? When your new nodes join
>>>> the ring, can you verify that they show up under a new DC and not as part
>>>> of the old?
>>>>
>>>> --Bryan
>>>>
>>>> On Thu, Sep 3, 2015 at 11:27 AM, Tom van den Berge <
>>>> tom.vandenbe...@gmail.com> wrote:
>>>>
>>>>> I want to start using vnodes in my cluster. To do so, I've set up a
>>>>> new data center with the same number of nodes as the existing one, as
>>>>> described in
>>>>> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configVnodesProduction_t.html.
>>>>> The new DC is in the same physical location as the old one.
>>>>>
>>>>> The problem I'm running into is that as soon as the nodes in the new
>>>>> data center are started, the application that is using the nodes in the 
>>>>> old
>>>>> data center is frequently getting error messages because queries don't
>>>>> return the expected data. I'm pretty sure this is because somehow these
>>>>> queries are routed to the new, empty data center. The application is not
>>>>> connecting to the nodes in the new DC.
>>>>>
>>>>> I've tried two different things to prevent this:
>>>>>
>>>>> 1) Ensure that all queries use either LOCAL_ONE or LOCAL_QUORUM
>>>>> consistency. Nevertheless, I'm still seeing failed queries.
>>>>> 2) Start the new nodes with -Dcassandra.join_ring=false, to prevent
>>>>> them from participating in the cluster. Although they don't show up in
>>>>> nodetool ring, I'm still seeing failed queries.
>>>>>
>>>>> If I understand it correctly, both measures should prevent queries
>>>>> from ending up in the new DC, but somehow they don't in my situation.
>>>>>
>>>>> How is it possible that queries are routed to the new, emtpy data
>>>>> center? And more importantly, how can I prevent it?
>>>>>
>>>>> Thanks,
>>>>> Tom
>>>>>
>>>>
>>>>
>>>
>>
>

Rebuild new DC nodes against new DC?

2015-08-31 Thread Bryan Cheng

Hi list,

We're bringing up a second DC, and following the procedure outlined here:
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_dc_to_cluster_t.html

We have three nodes in the new DC that are members of the cluster and
indicate that they are running normally. We have begun the process of
altering the keyspaces for multi-DC and are streaming over data via
nodetool rebuild on a keyspace-by-keyspace basis.

I couldn't find a clear answer for this: at what point is it safe to
rebuild from the new dc versus the old?

In other words, I have machines a, b, and c in DC2 (the new DC). I build a
and b by specifying DC1 on the rebuild command line. Can I safely rebuild
against DC2 for machine c? Is this at all dependent on quorum settings?

Our DC's are linked by a VPN that doesn't have as big of a pipe as we'd
like- streaming in the new DC would make things faster and ease some
headaches.

Thanks for any help!

--Bryan

Re: Incremental, Sequential repair?

2015-08-25 Thread Bryan Cheng

Thanks Robert! To clarify, you're referring to the process using
sstablerepairedset to mark sstables as repaired after a full repair with
autocompaction off? We're in the process of doing that throughout our
cluster now.

On Tue, Aug 25, 2015 at 3:30 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Aug 25, 2015 at 2:44 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 [2015-08-25 21:36:43,433] It is not possible to mix sequential repair and
 incremental repairs.

 Is this a limitation around a specific configuration? Or is it generally
 true that incremental and sequential repairs are not compatible?


 There's a migration process to sequential repairs.

 http://www.datastax.com/dev/blog/more-efficient-repairs

 etc.

 =Rob

Incremental, Sequential repair?

2015-08-25 Thread Bryan Cheng

Hey all,

Got a question about incremental repairs, a quick google search turned up
nothing conclusive.

In the docs, in a few places, sequential, incremental repairs are mentioned.

From
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html
(indirectly):

 You can combine repair options, such as parallel and incremental repair.

From http://www.datastax.com/dev/blog/more-efficient-repairs:

 Incremental repairs can be opted into via the -inc option to nodetool
repair. This is compatible with both sequential and parallel (-par) repair

However, when I try to run an incremental, sequential repair (nodetool
repair -inc), I get:

[2015-08-25 21:36:43,433] It is not possible to mix sequential repair and
incremental repairs.

Is this a limitation around a specific configuration? Or is it generally
true that incremental and sequential repairs are not compatible?

The cluster is a mixed 2.1.8/2.1.7, replication is NetworkTopology, with
LeveledCompaction (if it's relevant).

Thanks in advance!

Re: Change from single region EC2 to multi-region

2015-08-11 Thread Bryan Cheng

broadcast_address to public ip should be the correct configuration.
Assuming your firewall rules are all kosher, you may need to clear gossip
state?
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_gossip_purge.html

-- Forwarded message --
From: Asher Newcomer asher...@gmail.com
Date: Tue, Aug 11, 2015 at 11:51 AM
Subject: Change from single region EC2 to multi-region
To: user@cassandra.apache.org

X-post w/ SO: link
https://stackoverflow.com/questions/31949043/cassandra-change-from-single-region-ec2-to-multi-region

I have (had) a working 4 node Cassandra cluster setup in an EC2 VPC. Setup
was as follows:

172.18.100.110 - seed - DC1 / RAC1

172.18.100.111 - DC1 / RAC1

172.18.100.112 - seed - DC1 / RAC2

172.18.100.113 - DC1 / RAC2

All of the above nodes are in East-1D, and I have configured it using the
GossipingPropertyFileSnitch (I would rather not use the EC2 specific
snitches).

listen_address  broadcast_address were both set to the node's private IP.

I then wanted to expand the cluster into a new region (us-west). Because
cross-region private IP communication is not supported in EC2, I attempted
to change the settings to have the nodes communicate through their public
IPs.

listen_address remained set to private IP
broadcast_address was changed to the public IP
seeds_list IPs were changed to the appropriate public IPs

I restarted the nodes one by one expecting them to simply 'work', but now
they only see themselves and not the other nodes.

nodetool status consistently returns:

Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
DN 172.18.100.112 ? 256 ? 968aaa8a-32b7-4493-9747-3df1c3784164 r1
DN 172.18.100.113 ? 256 ? 8e03643c-9db8-4906-aabc-0a8f4f5c087d r1
UN [public IP of local node] 75.91 GB 256 ?
6fdcc85d-6c78-46f2-b41f-abfe1c86ac69 RAC1
DN 172.18.100.110 ? 256 ? fb7b78a8-d1cc-46fe-ab18-f0d3075cb426 r1

On each individual node, the other nodes seem 'stuck' using the private IP
addresses.

*How do I force the nodes to look for each other at their public addresses?*

I have fully opened the EC2 security group/firewall as a test to rule out
any problems there - and it hasn't helped.

Any ideas most appreciated.

Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng

Hi there,

Within our Cassandra cluster, we're observing, on occasion, one or two
nodes at a time becoming partially unresponsive.

We're running 2.1.7 across the entire cluster.

nodetool still reports the node as being healthy, and it does respond to
some local queries; however, the CPU is pegged at 100%. One common thread
(heh) each time this happens is that there always seems to be one of more
compaction threads running (via nodetool tpstats), and some appear to be
stuck (active count doesn't change, pending count doesn't decrease). A
request for compactionstats hangs with no response.

Each time we've seen this, the only thing that appears to resolve the issue
is a restart of the Cassandra process; the restart does not appear to be
clean, and requires one or more attempts (or a -9 on occasion).

There does not seem to be any pattern to what machines are affected; the
nodes thus far have been different instances on different physical machines
and on different racks.

Has anyone seen this before? Alternatively, when this happens again, what
data can we collect that would help with the debugging process (in addition
to tpstats)?

Thanks in advance,

Bryan

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng

Robert, thanks for these references! We're not using DTCS, so 9056 and 8243
seem out, but I'll take a look at 9577 (also looked at the referenced
thread on this list, which seems to have some interesting data)

On Wed, Jul 22, 2015 at 5:33 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.


 I've heard other reports of compaction appearing to stall in 2.1.7...
 wondering if you're affected by any of these...

 https://issues.apache.org/jira/browse/CASSANDRA-9577
 or
 https://issues.apache.org/jira/browse/CASSANDRA-9056 or
 https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be
 in 2.1.7)

 =Rob

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng

Hi Aiman,

We previously had issues with GC, but since upgrading to 2.1.7 things seem
a lot healthier.

We collect GC statistics through collectd via the garbage collector mbean,
ParNew GC's report sub 500ms collection time on average (I believe
accumulated per minute?) and CMS peaks at about 300ms collection time when
it runs.

On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com wrote:

 Hi Bryan
 How's GC behaving on these boxes?

 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 Hi there,

 Within our Cassandra cluster, we're observing, on occasion, one or two
 nodes at a time becoming partially unresponsive.

 We're running 2.1.7 across the entire cluster.

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.

 Each time we've seen this, the only thing that appears to resolve the
 issue is a restart of the Cassandra process; the restart does not appear to
 be clean, and requires one or more attempts (or a -9 on occasion).

 There does not seem to be any pattern to what machines are affected; the
 nodes thus far have been different instances on different physical machines
 and on different racks.

 Has anyone seen this before? Alternatively, when this happens again, what
 data can we collect that would help with the debugging process (in addition
 to tpstats)?

 Thanks in advance,

 Bryan




 --
 *Aiman Parvaiz*
 Lead Systems Architect
 ai...@flipagram.com
 cell: 213-300-6377
 http://flipagram.com/apz

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

2015-07-22 Thread Bryan Cheng

Aiman,

Your post made me look back at our data a bit. The most recent occurrence
of this incident was not preceded by any abnormal GC activity; however, the
previous occurrence (which took place a few days ago) did correspond to a
massive, order-of-magnitude increase in both ParNew and CMS collection
times which lasted ~17 hours.

Was there something in particular that links GC to these stalls? At this
point in time, we cannot identify any particular reason for either that GC
spike or the subsequent apparent compaction stall, although it did not seem
to have any effect on our usage of the cluster.

On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng br...@blockcypher.com wrote:

 Hi Aiman,

 We previously had issues with GC, but since upgrading to 2.1.7 things seem
 a lot healthier.

 We collect GC statistics through collectd via the garbage collector mbean,
 ParNew GC's report sub 500ms collection time on average (I believe
 accumulated per minute?) and CMS peaks at about 300ms collection time when
 it runs.

 On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz ai...@flipagram.com
 wrote:

 Hi Bryan
 How's GC behaving on these boxes?

 On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng br...@blockcypher.com
 wrote:

 Hi there,

 Within our Cassandra cluster, we're observing, on occasion, one or two
 nodes at a time becoming partially unresponsive.

 We're running 2.1.7 across the entire cluster.

 nodetool still reports the node as being healthy, and it does respond to
 some local queries; however, the CPU is pegged at 100%. One common thread
 (heh) each time this happens is that there always seems to be one of more
 compaction threads running (via nodetool tpstats), and some appear to be
 stuck (active count doesn't change, pending count doesn't decrease). A
 request for compactionstats hangs with no response.

 Each time we've seen this, the only thing that appears to resolve the
 issue is a restart of the Cassandra process; the restart does not appear to
 be clean, and requires one or more attempts (or a -9 on occasion).

 There does not seem to be any pattern to what machines are affected; the
 nodes thus far have been different instances on different physical machines
 and on different racks.

 Has anyone seen this before? Alternatively, when this happens again,
 what data can we collect that would help with the debugging process (in
 addition to tpstats)?

 Thanks in advance,

 Bryan




 --
 *Aiman Parvaiz*
 Lead Systems Architect
 ai...@flipagram.com
 cell: 213-300-6377
 http://flipagram.com/apz

63 matches

Mail list logo