Re: Streaming Process: How can we speed it up?

2016-09-15 Thread Vasileios Vlachos
Thanks for sharing your experience Ben

On 15 Sep 2016 11:35 am, "Ben Slater" <ben.sla...@instaclustr.com> wrote:

> We’ve successfully used the rsynch method you outline quite a few times in
> situations where we’ve had clusters that take forever to add new nodes
> (mainly due to secondary indexes) and need to do a quick replacement for
> one reason or another. As you mention, the main disadvantage we ran into is
> that the node doesn’t get cleaned up through the replacement process like a
> newly streamed node does (plus the extra operational complexity).
>
> Cheers
> Ben
>
> On Thu, 15 Sep 2016 at 19:47 Vasileios Vlachos <vasileiosvlac...@gmail.com>
> wrote:
>
>> Hello and thanks for your responses,
>>
>> OK, so increasing stream_throughput_outbound_megabits_per_sec makes no
>> difference. Any ideas why streaming is limited to only two of the three
>> nodes available?
>>
>> As an alternative to slow streaming I tried this:
>>
>>   - install C* on a new node, stop the service and delete
>> /var/lib/cassandra/*
>>  - rsync /etc/cassandra from old node to new node
>>  - rsync /var/lib/cassandra from old node to new node
>>  - stop C* on the old node
>>  - rsync /var/lib/cassandra from old node to new node
>>  - move the old node to a different IP
>>  - move the new node to the old node's original IP
>>  - start C* on the new node (no need for the replace_node option in
>> cassandra-env.sh)
>>
>> This technique has been successful so far for a demo cluster with fewer
>> data. The only disadvantage for us is that we were hoping that by streaming
>> the SSTables to the new node, tombstones would be discarded (freeing a lot
>> of disk space on our live cluster). This is exactly what happened for the
>> one node we streamed so far; unfortunately, the slow streaming generates a
>> lot of hints which makes recovery a very long process.
>>
>> Do you guys see any other problems with the rsync method that I've
>> skipped?
>>
>> Regarding the tombstones issue (if we finally do what I described above),
>> I'm thinking sstablsplit. Then compaction should deal with it (I think). I
>> have not used sstablesplit in the past, so another thing I'd like to ask is
>> if you guys find this a good/bad idea for what I'm trying to do.
>>
>> Many thanks,
>> Vasilis
>>
>> On Mon, Sep 12, 2016 at 6:42 PM, Jeff Jirsa <jji...@apache.org> wrote:
>>
>>>
>>>
>>> On 2016-09-12 09:38 (-0700), daemeon reiydelle <daeme...@gmail.com>
>>> wrote:
>>> > Re. throughput. That looks slow for jumbo with 10g. Check your
>>> networks.
>>> >
>>> >
>>>
>>> It's extremely unlikely you'll be able to saturate a 10g link with a
>>> single instance cassandra.
>>>
>>> Faster Cassandra streaming is a work in progress - being able to send
>>> more than one file at a time is probably the most obvious area for
>>> improvement, and being able to better deal with the CPU / garbage generated
>>> on the receiving side is just behind that. You'll likely be able to stream
>>> 10-15 MB/s per sending server or cpu core, whichever is less (in a vnode
>>> setup, you'll be cpu bound - in a single-token setup, you'll be stream
>>> bound).
>>>
>>>
>>>
>> --
> 
> Ben Slater
> Chief Product Officer
> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
> +61 437 929 798
>


Re: Streaming Process: How can we speed it up?

2016-09-15 Thread Vasileios Vlachos
Hello and thanks for your responses,

OK, so increasing stream_throughput_outbound_megabits_per_sec makes no
difference. Any ideas why streaming is limited to only two of the three
nodes available?

As an alternative to slow streaming I tried this:

  - install C* on a new node, stop the service and delete
/var/lib/cassandra/*
 - rsync /etc/cassandra from old node to new node
 - rsync /var/lib/cassandra from old node to new node
 - stop C* on the old node
 - rsync /var/lib/cassandra from old node to new node
 - move the old node to a different IP
 - move the new node to the old node's original IP
 - start C* on the new node (no need for the replace_node option in
cassandra-env.sh)

This technique has been successful so far for a demo cluster with fewer
data. The only disadvantage for us is that we were hoping that by streaming
the SSTables to the new node, tombstones would be discarded (freeing a lot
of disk space on our live cluster). This is exactly what happened for the
one node we streamed so far; unfortunately, the slow streaming generates a
lot of hints which makes recovery a very long process.

Do you guys see any other problems with the rsync method that I've skipped?

Regarding the tombstones issue (if we finally do what I described above),
I'm thinking sstablsplit. Then compaction should deal with it (I think). I
have not used sstablesplit in the past, so another thing I'd like to ask is
if you guys find this a good/bad idea for what I'm trying to do.

Many thanks,
Vasilis

On Mon, Sep 12, 2016 at 6:42 PM, Jeff Jirsa  wrote:

>
>
> On 2016-09-12 09:38 (-0700), daemeon reiydelle  wrote:
> > Re. throughput. That looks slow for jumbo with 10g. Check your networks.
> >
> >
>
> It's extremely unlikely you'll be able to saturate a 10g link with a
> single instance cassandra.
>
> Faster Cassandra streaming is a work in progress - being able to send more
> than one file at a time is probably the most obvious area for improvement,
> and being able to better deal with the CPU / garbage generated on the
> receiving side is just behind that. You'll likely be able to stream 10-15
> MB/s per sending server or cpu core, whichever is less (in a vnode setup,
> you'll be cpu bound - in a single-token setup, you'll be stream bound).
>
>
>


Re: Flush activity and dropped messages

2016-08-26 Thread Vasileios Vlachos
Hi Benedict,

This makes sense now. Thank you very much for your input.

Regards,
Vasilis

On 25 Aug 2016 10:30 am, "Benedict Elliott Smith" <bened...@apache.org>
wrote:

> You should update from 2.0 to avoid this behaviour, is the simple answer.
> You are correct that when the commit log gets full the memtables are
> flushed to make room.  2.0 has several interrelated problems here though:
>
> There is a maximum flush queue length property (I cannot recall its name),
> and once there are this many memtables flushing, no more writes can take
> place on the box, whatsoever.  You cannot simply increase this length,
> though, because that shrinks the maximum size of any single memtable (it
> is, iirc, total_memtable_space / (1 + flush_writers + max_queue_length)),
> which worsens write-amplification from compaction.
>
> This is because the memory management for memtables in 2.0 was really
> terrible, and this queue length was used to try to ensure the space
> allocated was not exceeded.
>
> Compounding this, when clearing the commit log 2.0 will flush all
> memtables with data in them regardless of it is useful to do so, meaning
> having more tables (that are actively written to) than your max queue
> length will necessarily cause stalls every time you run out of commit log
> space.
>
> In 2.1, none of these concerns apply.
>
>
> On 24 August 2016 at 23:40, Vasileios Vlachos <vasileiosvlac...@gmail.com>
> wrote:
>
>> Hello,
>>
>>
>>
>>
>>
>> We have an 8-node cluster spread out in 2 DCs, 4 nodes in each one. We
>> run C* 2.0.17 on Ubuntu 12.04 at the moment.
>>
>>
>>
>>
>> Our C# application often throws logs, which correlate with dropped
>> messages (counter mutations usually) in our logs. We think that if a
>> specific mutaiton stays in the queue for more than 5 seconds, Cassandra
>> drops it. This is also suggested by these lines in system.log:
>>
>> ERROR [ScheduledTasks:1] 2016-08-23 13:29:51,454 MessagingService.java
>> (line 912) MUTATION messages were dropped in last 5000 ms: 317 for internal
>> timeout and 0 for cross node timeout
>> ERROR [ScheduledTasks:1] 2016-08-23 13:29:51,454 MessagingService.java
>> (line 912) COUNTER_MUTATION messages were dropped in last 5000 ms: 6 for
>> internal timeout and 0 for cross node timeout
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
>> 55) Pool NameActive   Pending  Completed   Blocked
>>  All Time Blocked
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
>> 70) ReadStage 0 0  245177190 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
>> 70) RequestResponseStage  0 0 3530334509 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
>> 70) ReadRepairStage   0 01549567 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
>> 70) MutationStage48  1380 2540965500 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
>> 70) ReplicateOnWriteStage 0 0  189615571 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) GossipStage   0 0   20586077 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) CacheCleanupExecutor  0 0  0 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) MigrationStage0 0106 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) MemoryMeter   0 0 303029 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
>> 70) ValidationExecutor0 0  0 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
>> 70) FlushWriter   1 5 322604 1
>>  8227
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
>> 70) InternalResponseStage 0 0 35 

Re: Flush activity and dropped messages

2016-08-26 Thread Vasileios Vlachos
Hi Patrick and thanks for your reply,

We are monitoring disk usage and more and we don't seem to be running out
of space at the moment. We have separate partitions/disks for
commitlog/data.  Which one do you suspect and why?

Regards,
Vasilis

On 25 Aug 2016 4:01 pm, "Patrick McFadin" <pmcfa...@gmail.com> wrote:

This looks like you've run out of disk. What are your hardware specs?

Patrick


On Thursday, August 25, 2016, Benedict Elliott Smith <bened...@apache.org>
wrote:

> You should update from 2.0 to avoid this behaviour, is the simple answer.
> You are correct that when the commit log gets full the memtables are
> flushed to make room.  2.0 has several interrelated problems here though:
>
> There is a maximum flush queue length property (I cannot recall its name),
> and once there are this many memtables flushing, no more writes can take
> place on the box, whatsoever.  You cannot simply increase this length,
> though, because that shrinks the maximum size of any single memtable (it
> is, iirc, total_memtable_space / (1 + flush_writers + max_queue_length)),
> which worsens write-amplification from compaction.
>
> This is because the memory management for memtables in 2.0 was really
> terrible, and this queue length was used to try to ensure the space
> allocated was not exceeded.
>
> Compounding this, when clearing the commit log 2.0 will flush all
> memtables with data in them regardless of it is useful to do so, meaning
> having more tables (that are actively written to) than your max queue
> length will necessarily cause stalls every time you run out of commit log
> space.
>
> In 2.1, none of these concerns apply.
>
>
> On 24 August 2016 at 23:40, Vasileios Vlachos <vasileiosvlac...@gmail.com>
> wrote:
>
>> Hello,
>>
>>
>>
>>
>>
>> We have an 8-node cluster spread out in 2 DCs, 4 nodes in each one. We
>> run C* 2.0.17 on Ubuntu 12.04 at the moment.
>>
>>
>>
>>
>> Our C# application often throws logs, which correlate with dropped
>> messages (counter mutations usually) in our logs. We think that if a
>> specific mutaiton stays in the queue for more than 5 seconds, Cassandra
>> drops it. This is also suggested by these lines in system.log:
>>
>> ERROR [ScheduledTasks:1] 2016-08-23 13:29:51,454 MessagingService.java
>> (line 912) MUTATION messages were dropped in last 5000 ms: 317 for internal
>> timeout and 0 for cross node timeout
>> ERROR [ScheduledTasks:1] 2016-08-23 13:29:51,454 MessagingService.java
>> (line 912) COUNTER_MUTATION messages were dropped in last 5000 ms: 6 for
>> internal timeout and 0 for cross node timeout
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
>> 55) Pool NameActive   Pending  Completed   Blocked
>>  All Time Blocked
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
>> 70) ReadStage 0 0  245177190 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
>> 70) RequestResponseStage  0 0 3530334509 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
>> 70) ReadRepairStage   0 01549567 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
>> 70) MutationStage48  1380 2540965500 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
>> 70) ReplicateOnWriteStage 0 0  189615571 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) GossipStage   0 0   20586077 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) CacheCleanupExecutor  0 0  0 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) MigrationStage0 0106 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
>> 70) MemoryMeter   0 0 303029 0
>> 0
>>  INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
>> 70) ValidationExecutor0 0  0 0
>> 0
>>  INFO [ScheduledTasks:1]

Flush activity and dropped messages

2016-08-24 Thread Vasileios Vlachos
Hello,





We have an 8-node cluster spread out in 2 DCs, 4 nodes in each one. We run
C* 2.0.17 on Ubuntu 12.04 at the moment.




Our C# application often throws logs, which correlate with dropped messages
(counter mutations usually) in our logs. We think that if a specific
mutaiton stays in the queue for more than 5 seconds, Cassandra drops it.
This is also suggested by these lines in system.log:

ERROR [ScheduledTasks:1] 2016-08-23 13:29:51,454 MessagingService.java
(line 912) MUTATION messages were dropped in last 5000 ms: 317 for internal
timeout and 0 for cross node timeout
ERROR [ScheduledTasks:1] 2016-08-23 13:29:51,454 MessagingService.java
(line 912) COUNTER_MUTATION messages were dropped in last 5000 ms: 6 for
internal timeout and 0 for cross node timeout
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
55) Pool NameActive   Pending  Completed   Blocked
 All Time Blocked
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
70) ReadStage 0 0  245177190 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,455 StatusLogger.java (line
70) RequestResponseStage  0 0 3530334509 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
70) ReadRepairStage   0 01549567 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
70) MutationStage48  1380 2540965500 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,456 StatusLogger.java (line
70) ReplicateOnWriteStage 0 0  189615571 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
70) GossipStage   0 0   20586077 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
70) CacheCleanupExecutor  0 0  0 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
70) MigrationStage0 0106 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,457 StatusLogger.java (line
70) MemoryMeter   0 0 303029 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
70) ValidationExecutor0 0  0 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
70) FlushWriter   1 5 322604 1
 8227
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,458 StatusLogger.java (line
70) InternalResponseStage 0 0 35 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,459 StatusLogger.java (line
70) AntiEntropyStage  0 0  0 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,459 StatusLogger.java (line
70) MemtablePostFlusher   1 5 424104 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,459 StatusLogger.java (line
70) MiscStage 0 0  0 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,460 StatusLogger.java (line
70) PendingRangeCalculator0 0 37 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,460 StatusLogger.java (line
70) commitlog_archiver0 0  0 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,461 StatusLogger.java (line
70) CompactionExecutor4 45144499 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,461 StatusLogger.java (line
70) HintedHandoff 0 0   3194 0
0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,461 StatusLogger.java (line
79) CompactionManager 1 4
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,461 StatusLogger.java (line
81) Commitlog   n/a 0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,461 StatusLogger.java (line
93) MessagingServicen/a   0/0
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,462 StatusLogger.java (line
103) Cache Type Size Capacity
KeysToSave
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,462 StatusLogger.java (line
105) KeyCache  104828280104857600
   all
 INFO [ScheduledTasks:1] 2016-08-23 13:29:51,462 StatusLogger.java (line
111) RowCache  00
   all



So far we have noticed that when our application throws logs, 

Re: StatusLogger output

2016-03-31 Thread Vasileios Vlachos
Anyone else any idea on how to interpret StatusLogger output? As Sean said,
this may not help in determining the problem, but it would definitely help
my general understanding.

Thanks,
Bill

On Thu, Mar 24, 2016 at 5:24 PM, Vasileios Vlachos <
vasileiosvlac...@gmail.com> wrote:

> Thanks for your help Sean,
>
> The reason StatusLogger messages appear in the logs is usually, as you
> said, a GC pause (ParNew or CMS, I have seen both), or dropped messages. In
> our case dropped messages are always (so far) due to internal timeouts, not
> due to cross node timeouts (like the sample output in the link I provided
> earlier). I have seen StatusLogger output during low traffic times and I
> cannot say that we seem to have more logs during high-traffic hours.
>
> We use Nagios for monitoring and have several checks for cassandra (we use
> the JMX console for each node). However, most graphs are averaged out. I
> can see some spikes at the times, however, these spikes only go around
> 20-30% of the load we get during high-traffic times. The only time we have
> seen nodes marked down in the logs is when there is some severe cross-DC
> VPN issue, which is not something that happens often and does not correlate
> with StatusLogger output either.
>
> Regarding GC, we only see up to 10 GC pauses per day in the logs (I ofc
> mean over 200ms which is the threshold for logging GC events by default).
> We are actually experimenting with GC these days on one of the nodes, but I
> cannot say this has made things worse/better.
>
> I was hoping that by understanding the StatusLogger output better I'd be
> able to investigate further. We monitor metrics like hints, pending
> tasks, reads/writes per CF, read/write latency/CF, compactions,
> connections/node. If there is anything from the JMX console that you would
> suggest I should be monitoring, please let me know. I was thinking
> compactions may be the reason for this (so, I/O could be the bottleneck),
> but looking at the graphs I can see that when a node compacts its CPU usage
> would only max at around 20-30% and would only add 2-5ms of read/write
> latency per CF (if any).
>
> Thanks,
> Vasilis
>
> On Thu, Mar 24, 2016 at 3:31 PM, <sean_r_dur...@homedepot.com> wrote:
>
>> I am not sure the status logger output helps determine the problem.
>> However, the dropped mutations and the status logger output is what I see
>> when there is too high of a load on one or more Cassandra nodes. It could
>> be long GC pauses, something reading too much data (a large row or a
>> multi-partition query), or just too many requests for the number of nodes
>> you have. Are you using OpsCenter to monitor the rings? Do you have read or
>> write spikes at the time? Any GC messages in the log. Any nodes going down
>> at the time?
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com]
>> *Sent:* Thursday, March 24, 2016 8:13 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: StatusLogger output
>>
>>
>>
>> Just to clarify, I can see line 29 which seems to explain the format
>> (first number ops, second is data), however I don't know they actually
>> mean.
>>
>>
>>
>> Thanks,
>>
>> Vasilis
>>
>>
>>
>> On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
>> vasileiosvlac...@gmail.com> wrote:
>>
>> Hello,
>>
>>
>>
>> Environment:
>>
>> - Cassandra 2.0.17, 8 nodes, 4 per DC
>>
>> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>>
>>
>>
>> Every node seems to be dropping messages (anywhere from 10 to 300) twice
>> a day. I don't know it this has always been the case, but has definitely
>> been going for the past month or so. Whenever that happens we get
>> StatusLogger.java output in the log, which is the state of the node at
>> the time it dropped messages. This output contains information
>> similar/identical to nodetool tpstats, but further from that,
>> information regarding system CF follows as can be seen here:
>> http://ur1.ca/ooan6
>>
>>
>>
>> How can we use this information to find out what the problem was? I am
>> specifically referring to the information regarding the system CF. I had a
>> look in the system tables but I cannot draw anything from that. The output
>> in the log seems to contain two values (comma separated). What are these
>> numbers?
>>
>>
>>
>> I wasn't able to find anything on the web/DataStax docs. Any help would
>> be greatly appreciated!
>>
>>
&g

Re: StatusLogger output

2016-03-24 Thread Vasileios Vlachos
Thanks for your help Sean,

The reason StatusLogger messages appear in the logs is usually, as you
said, a GC pause (ParNew or CMS, I have seen both), or dropped messages. In
our case dropped messages are always (so far) due to internal timeouts, not
due to cross node timeouts (like the sample output in the link I provided
earlier). I have seen StatusLogger output during low traffic times and I
cannot say that we seem to have more logs during high-traffic hours.

We use Nagios for monitoring and have several checks for cassandra (we use
the JMX console for each node). However, most graphs are averaged out. I
can see some spikes at the times, however, these spikes only go around
20-30% of the load we get during high-traffic times. The only time we have
seen nodes marked down in the logs is when there is some severe cross-DC
VPN issue, which is not something that happens often and does not correlate
with StatusLogger output either.

Regarding GC, we only see up to 10 GC pauses per day in the logs (I ofc
mean over 200ms which is the threshold for logging GC events by default).
We are actually experimenting with GC these days on one of the nodes, but I
cannot say this has made things worse/better.

I was hoping that by understanding the StatusLogger output better I'd be
able to investigate further. We monitor metrics like hints, pending tasks,
reads/writes per CF, read/write latency/CF, compactions, connections/node.
If there is anything from the JMX console that you would suggest I should
be monitoring, please let me know. I was thinking compactions may be the
reason for this (so, I/O could be the bottleneck), but looking at the
graphs I can see that when a node compacts its CPU usage would only max at
around 20-30% and would only add 2-5ms of read/write latency per CF (if
any).

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 3:31 PM, <sean_r_dur...@homedepot.com> wrote:

> I am not sure the status logger output helps determine the problem.
> However, the dropped mutations and the status logger output is what I see
> when there is too high of a load on one or more Cassandra nodes. It could
> be long GC pauses, something reading too much data (a large row or a
> multi-partition query), or just too many requests for the number of nodes
> you have. Are you using OpsCenter to monitor the rings? Do you have read or
> write spikes at the time? Any GC messages in the log. Any nodes going down
> at the time?
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com]
> *Sent:* Thursday, March 24, 2016 8:13 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: StatusLogger output
>
>
>
> Just to clarify, I can see line 29 which seems to explain the format
> (first number ops, second is data), however I don't know they actually
> mean.
>
>
>
> Thanks,
>
> Vasilis
>
>
>
> On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
> vasileiosvlac...@gmail.com> wrote:
>
> Hello,
>
>
>
> Environment:
>
> - Cassandra 2.0.17, 8 nodes, 4 per DC
>
> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>
>
>
> Every node seems to be dropping messages (anywhere from 10 to 300) twice a
> day. I don't know it this has always been the case, but has definitely been
> going for the past month or so. Whenever that happens we get
> StatusLogger.java output in the log, which is the state of the node at
> the time it dropped messages. This output contains information
> similar/identical to nodetool tpstats, but further from that, information
> regarding system CF follows as can be seen here: http://ur1.ca/ooan6
>
>
>
> How can we use this information to find out what the problem was? I am
> specifically referring to the information regarding the system CF. I had a
> look in the system tables but I cannot draw anything from that. The output
> in the log seems to contain two values (comma separated). What are these
> numbers?
>
>
>
> I wasn't able to find anything on the web/DataStax docs. Any help would be
> greatly appreciated!
>
>
>
> Thanks,
>
> Vasilis
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability

Re: StatusLogger output

2016-03-24 Thread Vasileios Vlachos
Just to clarify, I can see line 29 which seems to explain the format (first
number ops, second is data), however I don't know they actually mean.

Thanks,
Vasilis

On Thu, Mar 24, 2016 at 11:45 AM, Vasileios Vlachos <
vasileiosvlac...@gmail.com> wrote:

> Hello,
>
> Environment:
> - Cassandra 2.0.17, 8 nodes, 4 per DC
> - Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)
>
> Every node seems to be dropping messages (anywhere from 10 to 300) twice a
> day. I don't know it this has always been the case, but has definitely been
> going for the past month or so. Whenever that happens we get
> StatusLogger.java output in the log, which is the state of the node at
> the time it dropped messages. This output contains information
> similar/identical to nodetool tpstats, but further from that, information
> regarding system CF follows as can be seen here: http://ur1.ca/ooan6
>
> How can we use this information to find out what the problem was? I am
> specifically referring to the information regarding the system CF. I had a
> look in the system tables but I cannot draw anything from that. The output
> in the log seems to contain two values (comma separated). What are these
> numbers?
>
> I wasn't able to find anything on the web/DataStax docs. Any help would be
> greatly appreciated!
>
> Thanks,
> Vasilis
>


StatusLogger output

2016-03-24 Thread Vasileios Vlachos
Hello,

Environment:
- Cassandra 2.0.17, 8 nodes, 4 per DC
- Ubuntu 12.04, 6-Cores, 16GB of RAM (we use VMWare)

Every node seems to be dropping messages (anywhere from 10 to 300) twice a
day. I don't know it this has always been the case, but has definitely been
going for the past month or so. Whenever that happens we get
StatusLogger.java output in the log, which is the state of the node at the
time it dropped messages. This output contains information
similar/identical to nodetool tpstats, but further from that, information
regarding system CF follows as can be seen here: http://ur1.ca/ooan6

How can we use this information to find out what the problem was? I am
specifically referring to the information regarding the system CF. I had a
look in the system tables but I cannot draw anything from that. The output
in the log seems to contain two values (comma separated). What are these
numbers?

I wasn't able to find anything on the web/DataStax docs. Any help would be
greatly appreciated!

Thanks,
Vasilis


Re: How do I upgrade from 2.0.16 to 2.0.17 in my case????

2016-01-11 Thread Vasileios Vlachos
Thanks Michael,

I'll try that then. I need to figure out how to do it with Ubuntu's upstart
because I've not done it before.
On 7 Jan 2016 4:25 pm, "Michael Shuler" <mich...@pbandjelly.org> wrote:

> On 01/07/2016 07:52 AM, Vasileios Vlachos wrote:
> > Hello,
> >
> > My problem is described CASSANDRA-10872
> > <https://issues.apache.org/jira/browse/CASSANDRA-10872>. I upgraded a
> > second node on the same cluster in case there was something special with
> > the first node but I experienced identical behaviour. Both
> > cassandra-env.sh and cassandra-rackdc.properties were replaced
> > causing the node to come up in the default data centre DC1.
> >
> > What is the best way to upgrade to 2.0.17 in a safe manner in this case?
> > How do we work around this?
>
> I've made a bit of headway on this, but don't have this automated in CI
> fully, yet. In quick tests, I get prompted on upgrade when my config
> files have changed from the originals, similar to your later comment on
> that JIRA. This replacement without prompt could be a system
> configuration to not prompt you(?). I'm not sure how one would change
> that behavior system-wide, since I've never turned this knob, but I'd
> suggest looking at debconf options.
>
> I'm in favor of CASSANDRA-2356, and with the beginning of tick-tock
> releases, this is a good time to get this in as a new feature. As for
> configuring your existing system to not restart services on upgrade, see
> https://people.debian.org/~hmh/invokerc.d-policyrc.d-specification.txt
> for setting up a local policy to behave as you wish.
>
> --
> Michael
>


How do I upgrade from 2.0.16 to 2.0.17 in my case????

2016-01-07 Thread Vasileios Vlachos
Hello,

My problem is described CASSANDRA-10872
. I upgraded a
second node on the same cluster in case there was something special with
the first node but I experienced identical behaviour. Both cassandra-env.sh
and cassandra-rackdc.properties were replaced
causing the node to come up in the default data centre DC1.

What is the best way to upgrade to 2.0.17 in a safe manner in this case?
How do we work around this?

Thanks,
Vasilis


Re: Thousands of pending compactions using STCS

2015-12-11 Thread Vasileios Vlachos
Anuj, Jeff, thank you both,

Although harmless, sounds like it's time for an upgrade. The ticket
suggests that 2.0.17 is not affected.

Thank you guys!

On Fri, Dec 11, 2015 at 5:25 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
wrote:

> Same bug also affects 2.0.16 -
> https://issues.apache.org/jira/browse/CASSANDRA-9662
>
> From: Jeff Jirsa
> Reply-To: <user@cassandra.apache.org>
> Date: Friday, December 11, 2015 at 9:12 AM
> To: "user@cassandra.apache.org"
> Subject: Re: Thousands of pending compactions using STCS
>
> There were a few buggy versions in 2.1 (2.1.7, 2.1.8, I believe) that
> showed this behavior. The number of pending compactions was artificially
> high, and not meaningful. As long as they number of –Data.db sstables
> remains normal, compaction is keeping up and you’re fine.
>
> - Jeff
>
> From: Vasileios Vlachos
> Reply-To: "user@cassandra.apache.org"
> Date: Friday, December 11, 2015 at 8:28 AM
> To: "user@cassandra.apache.org"
> Subject: Thousands of pending compactions using STCS
>
> Hello,
>
> We use Nagios and MX4J for the majority of the monitoring we do for
> Cassandra (version: 2.0.16). For compactions we hit the following URL:
>
>
> http://${cassandra_host}:8081/mbean?objectname=org.apache.cassandra.db%3Atype%3DCompactionManager
>
> and check the PendingTasks counter's value.
>
> We have noticed that occasionally one or more nodes will report back that
> they have thousands of pending compactions. We have 11 KS in the cluster
> and a total of 109 *Data.db files under /var/lib/cassandra/data which
> gives approximately 10 SSTables per KS. That makes us think that having
> thousands of pending compactions seems unrealistic given the number of
> SSTables we seem to have at any given time in each KS/CF directory. The
> logs show a lot of flush and compaction activity but we don't think that's
> unusual. Also, each CF is configured to have min_compaction_threshold = 2
> and max_compaction_threshold = 32. The two screenshots below show a
> cluster-wide view of pending compactions. Attached you can find the XML
> files which contain the data from the MX4J console.
>
> [image: Inline image 2]
>
> And this is from the same graph, but I've selected the time period after
> 14:00 in order to show what the real compaction activity looks like when
> not skewed by the incredibly high number of pending compactions as shown
> above:
> [image: Inline image 3]
>
> Has anyone else experienced something similar? Is there something else we
> can do to see if this is something wrong with our cluster?
>
> Thanks in advance for any help!
>
> Vasilis
>


Re: Upgrade instructions don't make sense

2015-11-23 Thread Vasileios Vlachos
If you want to go from 2.0 to 2.1 and you are NOT using vnodes on your
current cluster (that is version 2.0), then make sure you disable them on
the new 2.1 config during the upgrade. Otherwise just leave the setting as
is.

That's how I understand it personally.  We are going to upgrade at some
point so I'd appreciate if people could confirm that.
On 23 Nov 2015 11:23 pm, "Sebastian Estevez" 
wrote:

> If your cluster does not use vnodes, disable vnodes in each new
>> cassandra.yaml
>
>
> If your cluster *does* use vnodes do *not* disable them.
>
> All the best,
>
>
> [image: datastax_logo.png] 
>
> Sebastián Estévez
>
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>
> [image: linkedin.png]  [image:
> facebook.png]  [image: twitter.png]
>  [image: g+.png]
> 
> 
> 
>
>
> 
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
> On Mon, Nov 23, 2015 at 5:55 PM, Robert Wille  wrote:
>
>> I’m wanting to upgrade from 2.0 to 2.1. The upgrade instructions at
>> http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html
>>  has
>> the following, which leaves me with more questions than it answers:
>>
>> If your cluster does not use vnodes, disable vnodes in each new
>> cassandra.yaml before doing the rolling restart.
>> In Cassandra 2.0.x, virtual nodes (vnodes) are enabled by default.
>> Disable vnodes in the 2.0.x version before upgrading.
>>
>>1. In the cassandra.yaml
>>
>> 
>>  file,
>>set num_tokens to 1.
>>2. Uncomment the initial_token property and set it to 1 or to the
>>value of a generated token
>>
>> 
>>  for
>>a multi-node cluster.
>>
>>
>> It seems strange that vnodes has to be disabled to upgrade, but whatever.
>> If I use an initial token generator to set the initial_token property of
>> each node, then I assume that my token ranges are all going to change, and
>> that there’s going to be a whole bunch of streaming as the data is shuffled
>> around. The docs don’t mention that. Should I wait until the streaming is
>> done before proceeding with the upgrade?
>>
>> The docs don’t talk about vnodes and initial_tokens post-upgrade. Can I
>> turn vnodes back on? Am I forever after stuck with having to have manually
>> generated initial tokens (and needing to have a unique cassandra.yaml for
>> every node)? Can I just set num_tokens = 256 and comment out initial_token
>> and do a rolling restart?
>>
>> Thanks in advance
>>
>> Robert
>>
>>
>


Re: Upgrade instructions don't make sense

2015-11-23 Thread Vasileios Vlachos
Exactly, thanks!
On 23 Nov 2015 11:26 pm, "Vasileios Vlachos" <vasileiosvlac...@gmail.com>
wrote:

> If you want to go from 2.0 to 2.1 and you are NOT using vnodes on your
> current cluster (that is version 2.0), then make sure you disable them on
> the new 2.1 config during the upgrade. Otherwise just leave the setting as
> is.
>
> That's how I understand it personally.  We are going to upgrade at some
> point so I'd appreciate if people could confirm that.
> On 23 Nov 2015 11:23 pm, "Sebastian Estevez" <
> sebastian.este...@datastax.com> wrote:
>
>> If your cluster does not use vnodes, disable vnodes in each new
>>> cassandra.yaml
>>
>>
>> If your cluster *does* use vnodes do *not* disable them.
>>
>> All the best,
>>
>>
>> [image: datastax_logo.png] <http://www.datastax.com/>
>>
>> Sebastián Estévez
>>
>> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>>
>> [image: linkedin.png] <https://www.linkedin.com/company/datastax> [image:
>> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
>> <https://twitter.com/datastax> [image: g+.png]
>> <https://plus.google.com/+Datastax/about>
>> <http://feeds.feedburner.com/datastax>
>> <http://goog_410786983>
>>
>>
>> <http://www.datastax.com/gartner-magic-quadrant-odbms>
>>
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>
>> On Mon, Nov 23, 2015 at 5:55 PM, Robert Wille <rwi...@fold3.com> wrote:
>>
>>> I’m wanting to upgrade from 2.0 to 2.1. The upgrade instructions at
>>> http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html
>>>  has
>>> the following, which leaves me with more questions than it answers:
>>>
>>> If your cluster does not use vnodes, disable vnodes in each new
>>> cassandra.yaml before doing the rolling restart.
>>> In Cassandra 2.0.x, virtual nodes (vnodes) are enabled by default.
>>> Disable vnodes in the 2.0.x version before upgrading.
>>>
>>>1. In the cassandra.yaml
>>>
>>> <http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html#upgradeCassandraDetails__cassandrayaml_unique_7>
>>>  file,
>>>set num_tokens to 1.
>>>2. Uncomment the initial_token property and set it to 1 or to the
>>>value of a generated token
>>>
>>> <http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configGenTokens_c.html>
>>>  for
>>>a multi-node cluster.
>>>
>>>
>>> It seems strange that vnodes has to be disabled to upgrade, but
>>> whatever. If I use an initial token generator to set the initial_token
>>> property of each node, then I assume that my token ranges are all going to
>>> change, and that there’s going to be a whole bunch of streaming as the data
>>> is shuffled around. The docs don’t mention that. Should I wait until the
>>> streaming is done before proceeding with the upgrade?
>>>
>>> The docs don’t talk about vnodes and initial_tokens post-upgrade. Can I
>>> turn vnodes back on? Am I forever after stuck with having to have manually
>>> generated initial tokens (and needing to have a unique cassandra.yaml for
>>> every node)? Can I just set num_tokens = 256 and comment out initial_token
>>> and do a rolling restart?
>>>
>>> Thanks in advance
>>>
>>> Robert
>>>
>>>
>>


Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?

2015-10-27 Thread Vasileios Vlachos
Rob,

Would you mind to elaborate further on this? I am a little concerned that
my understanding (nodetool repair is *not* the only way one can achieve
consistency) is not correct. I understand that if people use CL < QUORUM,
nodetool repair is the only way to go, but I just cannot see how can that
be the only way irrespective of everything else.

Thanks in advance for your input!

On Sat, Oct 24, 2015 at 10:02 PM, Vasileios Vlachos <
vasileiosvlac...@gmail.com> wrote:

>
>> All other means of repair are optimizations which require a certain
>> amount of luck to happen to result in consistency.
>>
>
> Is that true regardless of the CL one uses? So, for example if writing
> QUORUM and reading QUORUM, wouldn't an increased read_repair_chance
> probability be sufficient? If not, is there a case where nodetool repair
> wouldn't be required (given consistency is a requirement)?
>
> Thanks
>


Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?

2015-10-24 Thread Vasileios Vlachos
I am not sure I fully understand the question, because nodetool repair is
one of the three ways for Cassandra to ensure consistency. If by "affect"
you mean "make your data consistent and ensure all replicas are
up-to-date", then yes, that's what I think it does.

And yes, I would expect nodetool repair (especially depending on the
options appended to it) to have a performance impact, but how big that
impact is going to be depends on many things.

We currently perform no scheduled repairs because of our workload and the
consistency level that we use. So, as you can understand I am certainly not
the best person to analyse that bit...

Regards,
Vasilis

On Sat, Oct 24, 2015 at 5:09 PM, Ajay Garg <ajaygargn...@gmail.com> wrote:

> Thanks a ton Vasileios !!
>
> Just one last question ::
> Does running "nodetool repair" affect the functionality of cluster for
> current-live data?
>
> It's ok if the insertions/deletions of current-live data become a little
> slow during the process, but data-consistency must be maintained. If that
> is the case, I think we are good.
>
>
> Thanks and Regards,
> Ajay
>
> On Sat, Oct 24, 2015 at 6:03 PM, Vasileios Vlachos <
> vasileiosvlac...@gmail.com> wrote:
>
>> Hello Ajay,
>>
>> Here is a good link:
>>
>> http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesManualRepair.html
>>
>> Generally, I find the DataStax docs to be OK. You could consult them for
>> all usual operations etc. Ofc there are occasions where a given concept is
>> not as clear, but you can always ask this list for clarification.
>>
>> If you find that something is wrong in the docs just email them (more
>> info and contact email here: http://docs.datastax.com/en/ ).
>>
>> Regards,
>> Vasilis
>>
>> On Sat, Oct 24, 2015 at 1:04 PM, Ajay Garg <ajaygargn...@gmail.com>
>> wrote:
>>
>>> Thanks Vasileios for the reply !!!
>>> That makes sense !!!
>>>
>>> I will be grateful if you could point me to the node-repair command for
>>> Cassandra-2.1.10.
>>> I don't want to get stuck in a wrong-versioned documentation (already
>>> bitten once hard when setting up replication).
>>>
>>> Thanks again...
>>>
>>>
>>> Thanks and Regards,
>>> Ajay
>>>
>>> On Sat, Oct 24, 2015 at 4:14 PM, Vasileios Vlachos <
>>> vasileiosvlac...@gmail.com> wrote:
>>>
>>>> Hello Ajay,
>>>>
>>>> Have a look in the *max_hint_window_in_ms* :
>>>>
>>>>
>>>> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html
>>>>
>>>> My understanding is that if a node remains down for more than
>>>> *max_hint_window_in_ms*, then you will need to repair that node.
>>>>
>>>> Thanks,
>>>> Vasilis
>>>>
>>>> On Sat, Oct 24, 2015 at 7:48 AM, Ajay Garg <ajaygargn...@gmail.com>
>>>> wrote:
>>>>
>>>>> If a node in the cluster goes down and comes up, the data gets synced
>>>>> up on this downed node.
>>>>> Is there a limit on the interval for which the node can remain down?
>>>>> Or the data will be synced up even if the node remains down for
>>>>> weeks/months/years?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Ajay
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ajay
>>>
>>
>>
>
>
> --
> Regards,
> Ajay
>


Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?

2015-10-24 Thread Vasileios Vlachos
>
>
> All other means of repair are optimizations which require a certain amount
> of luck to happen to result in consistency.
>

Is that true regardless of the CL one uses? So, for example if writing
QUORUM and reading QUORUM, wouldn't an increased read_repair_chance
probability be sufficient? If not, is there a case where nodetool repair
wouldn't be required (given consistency is a requirement)?

Thanks


Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?

2015-10-24 Thread Vasileios Vlachos
Hello Ajay,

Here is a good link:
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesManualRepair.html

Generally, I find the DataStax docs to be OK. You could consult them for
all usual operations etc. Ofc there are occasions where a given concept is
not as clear, but you can always ask this list for clarification.

If you find that something is wrong in the docs just email them (more info
and contact email here: http://docs.datastax.com/en/ ).

Regards,
Vasilis

On Sat, Oct 24, 2015 at 1:04 PM, Ajay Garg <ajaygargn...@gmail.com> wrote:

> Thanks Vasileios for the reply !!!
> That makes sense !!!
>
> I will be grateful if you could point me to the node-repair command for
> Cassandra-2.1.10.
> I don't want to get stuck in a wrong-versioned documentation (already
> bitten once hard when setting up replication).
>
> Thanks again...
>
>
> Thanks and Regards,
> Ajay
>
> On Sat, Oct 24, 2015 at 4:14 PM, Vasileios Vlachos <
> vasileiosvlac...@gmail.com> wrote:
>
>> Hello Ajay,
>>
>> Have a look in the *max_hint_window_in_ms* :
>>
>>
>> http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html
>>
>> My understanding is that if a node remains down for more than
>> *max_hint_window_in_ms*, then you will need to repair that node.
>>
>> Thanks,
>> Vasilis
>>
>> On Sat, Oct 24, 2015 at 7:48 AM, Ajay Garg <ajaygargn...@gmail.com>
>> wrote:
>>
>>> If a node in the cluster goes down and comes up, the data gets synced up
>>> on this downed node.
>>> Is there a limit on the interval for which the node can remain down? Or
>>> the data will be synced up even if the node remains down for
>>> weeks/months/years?
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Ajay
>>>
>>
>>
>
>
> --
> Regards,
> Ajay
>


Re: Downtime-Limit for a node in Network-Topology-Replication-Cluster?

2015-10-24 Thread Vasileios Vlachos
Hello Ajay,

Have a look in the *max_hint_window_in_ms* :

http://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html

My understanding is that if a node remains down for more than
*max_hint_window_in_ms*, then you will need to repair that node.

Thanks,
Vasilis

On Sat, Oct 24, 2015 at 7:48 AM, Ajay Garg  wrote:

> If a node in the cluster goes down and comes up, the data gets synced up
> on this downed node.
> Is there a limit on the interval for which the node can remain down? Or
> the data will be synced up even if the node remains down for
> weeks/months/years?
>
>
>
> --
> Regards,
> Ajay
>


Re: Upgrade Limitations Question

2015-09-17 Thread Vasileios Vlachos
Thank you very much for pointing this out Victor. Really useful to know.

On Wed, Sep 16, 2015 at 4:55 PM, Victor Chen <victor.h.c...@gmail.com>
wrote:

> Yes, you can examine the actual sstables in your cassandra data dir. That
> will tell you what version sstables you have on that node.
>
> You can refer to this link:
> http://www.bajb.net/2013/03/cassandra-sstable-format-version-numbers/
> which I found via google search phrase "sstable versions" to see which
> version you need to look for-- the relevant section of the link says:
>
>> Cassandra stores the version of the SSTable within the filename,
>> following the format *Keyspace-ColumnFamily-(optional tmp
>> marker-)SSTableFormat-generation*
>>
>
> FYI-- and at least in the cassandra-2.1 branch of the source code-- you
> can find sstable format generation descriptions in comments of
> Descriptor.java. Looks like for your old and new versions, you'd be looking
> for something like:
>
> for 1.2.1:
> $ find  -name "*-ib-*" -ls
>
> for 2.0.1:
> $ find  -name "*-jb-*" -ls
>
>
> On Wed, Sep 16, 2015 at 10:02 AM, Vasileios Vlachos <
> vasileiosvlac...@gmail.com> wrote:
>
>>
>> Hello Rob and thanks for your reply,
>>
>> At the end we had to wait for the upgradesstables ti finish on every
>> node. Just to eliminate the possibility of this being the reason of any
>> weird behaviour after the upgrade. However, this process might take a long
>> time in a cluster with a large number of nodes which means no new work can
>> be done for that period.
>>
>> 1) TRUNCATE requires all known nodes to be available to succeed, if you
>>> are restarting one, it won't be available.
>>>
>>
>> I suppose all means all, not all replicas here, is that right? Not
>> directly related to the original question, but that might explain why we
>> end up with peculiar behaviour some times when we run TRUNCATE. We've now
>> taken the approach DROP it and do it again when possible (even though this
>> is still problematic when using the same CF name).
>>
>>
>>> 2) in theory, the newly upgraded nodes might not get the DDL schema
>>> update properly due to some incompatible change
>>>
>>> To check for 2, do :
>>> "
>>> nodetool gossipinfo | grep SCHEMA |sort | uniq -c | sort -n
>>> "
>>>
>>> Before and after and make sure the schema propagates correctly. There
>>> should be a new version on all nodes between each DDL change, if there is
>>> you will likely be able to see the new schema on all the new nodes.
>>>
>>>
>> Yeas, this makes perfect sense. We monitor the schema changes every
>> minutes across the cluster with Nagios by checking the JMX console. It is
>> an important thing to monitor in several situations (running migrations for
>> example, or during upgrades like you describe here).
>>
>> Is there a way to find out if the upgradesstables has been run against a
>> particular node or not?
>>
>> Many Thanks,
>> Vasilis
>>
>
>


Re: Upgrade Limitations Question

2015-09-16 Thread Vasileios Vlachos
Hello Rob and thanks for your reply,

At the end we had to wait for the upgradesstables ti finish on every node.
Just to eliminate the possibility of this being the reason of any weird
behaviour after the upgrade. However, this process might take a long time
in a cluster with a large number of nodes which means no new work can be
done for that period.

1) TRUNCATE requires all known nodes to be available to succeed, if you are
> restarting one, it won't be available.
>

I suppose all means all, not all replicas here, is that right? Not directly
related to the original question, but that might explain why we end up with
peculiar behaviour some times when we run TRUNCATE. We've now taken the
approach DROP it and do it again when possible (even though this is still
problematic when using the same CF name).


> 2) in theory, the newly upgraded nodes might not get the DDL schema update
> properly due to some incompatible change
>
> To check for 2, do :
> "
> nodetool gossipinfo | grep SCHEMA |sort | uniq -c | sort -n
> "
>
> Before and after and make sure the schema propagates correctly. There
> should be a new version on all nodes between each DDL change, if there is
> you will likely be able to see the new schema on all the new nodes.
>
>
Yeas, this makes perfect sense. We monitor the schema changes every minutes
across the cluster with Nagios by checking the JMX console. It is an
important thing to monitor in several situations (running migrations for
example, or during upgrades like you describe here).

Is there a way to find out if the upgradesstables has been run against a
particular node or not?

Many Thanks,
Vasilis


Re: Upgrade Limitations Question

2015-09-13 Thread Vasileios Vlachos
Any thoughts anyone?
On 9 Sep 2015 20:09, "Vasileios Vlachos" <vasileiosvlac...@gmail.com> wrote:

> Hello All,
>
> I've asked this on the Cassandra IRC channel earlier, but I am asking the
> list as well so that I get feedback from more people.
>
> We have recently upgraded from Cassandra 1.2.19 to 2.0.16 and we are
> currently in the stage where all boxes are running 2.0.16 but nt
> upgradesstables has not yet been performed on all of them. Reading the
> DataStax docs [1] :
>
>- Do not issue these types of queries during a rolling restart: DDL,
>TRUNCATE
>
> In our case the restart bit has already been done. Do you know if it would
> be a bad idea to create a new KS before all nodes have upgraded their
> SSTables? Our concern is the time it takes to go through every single node,
> run the upgradesstables and wait until it's all done. We think creating a
> new KS wouldn't be a problem (someone on the channel said the same thing,
> but recommended that we play safe and wait until it's all done). But if
> anyone has any catastrophic experiences in doing so we would appreciate
> their input.
>
> Many thanks,
> Vasilis
>
> [1]
> http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html
>


Upgrade Limitations Question

2015-09-09 Thread Vasileios Vlachos
Hello All,

I've asked this on the Cassandra IRC channel earlier, but I am asking the
list as well so that I get feedback from more people.

We have recently upgraded from Cassandra 1.2.19 to 2.0.16 and we are
currently in the stage where all boxes are running 2.0.16 but nt
upgradesstables has not yet been performed on all of them. Reading the
DataStax docs [1] :

   - Do not issue these types of queries during a rolling restart: DDL,
   TRUNCATE

In our case the restart bit has already been done. Do you know if it would
be a bad idea to create a new KS before all nodes have upgraded their
SSTables? Our concern is the time it takes to go through every single node,
run the upgradesstables and wait until it's all done. We think creating a
new KS wouldn't be a problem (someone on the channel said the same thing,
but recommended that we play safe and wait until it's all done). But if
anyone has any catastrophic experiences in doing so we would appreciate
their input.

Many thanks,
Vasilis

[1]
http://docs.datastax.com/en/upgrade/doc/upgrade/cassandra/upgradeCassandraDetails.html


Re: Replacing dead node and cassandra.replace_address

2015-09-08 Thread Vasileios Vlachos
I think you should be able to see the streaming process by running nodetool
netstats. I also think system.log displays similar information about
stemming/when stemming is finished. Shouldn't the state of the node change
to UP when bootstrap is completed as well?

People, correct me if I'm wrong here...
On 8 Sep 2015 20:56, "Maciek Sakrejda"  wrote:

> On Tue, Sep 8, 2015 at 11:14 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>> Once the new node is bootstrapped, you could remove replacement_address
>> from the env.sh file
>>
> Thanks, but how do I know when bootstrapping is completed?
>


CL not satisfied when new node is joining?

2015-06-16 Thread Vasileios Vlachos
Hello,

We have a demo cassandra cluster (version 1.2.18) with two DCs and 4 nodes
in total. Our client is using the Datastax C# driver (version 1.2.7).
RF='DC1':2, 'DC2':2. The consistency level is set to LOCAL_QUORUM and all
traffic is coming directly from the application servers in DC1, which then
asynchronously replicates to DC2 (so the LOCAL DC from the application's
perspective is DC1). There are two nodes in each DC and even though that's
a demo cluster, we thought it would be nice to add another node in each DC
to be able to handle failures/maintenance downtime.

We started by adding a new node to DC2 as per instructions here:
http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html

Almost immediately after the cassandra process was started on this new
node, application logs were thrown which looked like so:

...
System.AggregateException: One or more errors occurred. ---
Cassandra.WriteTimeoutException: Cassandra timeout during write query at
consistency LOCALQUORUM (2 replica(s) acknowledged the write over 3
required)
...

and several other timeouts... During this process we were tailing the
system.log from all 5 cassandra nodes and there were no errors or warning
signs. The application though continued to throw logs similar to the one
above until the node streamed all the data and went from 'UJ' to 'UN'
state, as it appears in the output of nodetool status. After the node was
fully joined to the cluster there have not been similar logs. Not sure if
this is related or not, but we also noticed a schema disagreement in the
cluster while adding the new node:

new_node: 01f0eb0b-82d6-38de-b943-d4f31ca29b98
all other nodes: 2aa39f66-0f1a-3202-8c28-8469ebfdf622

We fixed this by restarting the new node after it had joined the cluster.
All nodes agree that the schema version is
01f0eb0b-82d6-38de-b943-d4f31ca29b98 (not sure why, I would expect the
new_node to agree with the rest).

Initially we thought the issue was related to this:
https://issues.apache.org/jira/browse/CASSANDRA-833

but the more we read about it the more unrelated it feels, plus it appears
to be fixed in the version we are running.

We tried reproducing the issue on a local cluster but we were unable to do
so.

Shouldn't LOCAL_QUORUM require 2 local replicas instead of 3 during the
time the new node was joining the cluster? There are not 3 local replicas
anyway.

Thanks for any help.

Vasilis


Re: Best way to alert/monitor nodetool status” down.

2015-03-08 Thread Vasileios Vlachos

We use Nagios for monitoring, and we call the following through NRPE:

#!/bin/bash

# Just for reference:
# Nodetool's output represents Status ans State in this order.
# Status values: U (up), D (down)
# State values: N (normal), L (leaving), J (joining), M (moving)

NODETOOL=$(which nodetool);
NODES_DOWN=$(${NODETOOL} --host localhost status | grep --count -E 
'^D[A-Z]');


if [[ ${NODES_DOWN} -gt 0 ]]; then
output=CRITICAL - Nodes down: ${NODES_DOWN};
return_code=2;
elif [[ ${NODES_DOWN} -eq 0 ]]; then
output=OK - Nodes down: ${NODES_DOWN};
return_code=0;
else
output=UNKNOWN - Couldn't retrieve cluster information.;
return_code=3;
fi

echo ${output};
exit ${return_code};

I've not used zabbix so I'm not sure the exit codes etc are the same for 
you. Also, you may need to modify the REGEX slightly depending on the 
Cassandra version you are using. There must be a way to get this via the 
JMX console as well, which might be easier for you to monitor.


On 07/03/15 00:37, Kevin Burton wrote:
What’s the best way to monitor nodetool status being down? IE if a 
specific server things a node is down (DN).


Does this just use JMX?  IS there an API we can call?

We want to tie it into our zabbix server so we can detect if here is 
failure.


--
Founder/CEO Spinn3r.com http://Spinn3r.com
Location: *San Francisco, CA*
blog:**http://burtonator.wordpress.com
… or check out my Google+ profile 
https://plus.google.com/102718274791889610666/posts

http://spinn3r.com


--
Kind Regards,

Vasileios Vlachos

IT Infrastructure Engineer
MSc Internet  Wireless Computing
BEng Electronics Engineering
Cisco Certified Network Associate (CCNA)



Re: Should one expect to see hints being stored/delivered occasionally?

2015-01-30 Thread Vasileios Vlachos

Thanks for your reply Rob, I am back to this after a while...

I am not sure if this is different in 1.2.18, but I remember from older 
versions that GC pauses would only be logged in the /system.log/ if 
their duration /was = 200ms/. Also, when hints are detected, we cannot 
correlate it with GC pauses. We are thinking of tweaking the GC logging 
settings in the /cassandra-env/ file, but we are unsure as to which ones 
are going to be heavy for the server and which ones are safer to modify. 
Would you be able to advice on this?


The hints issue we seem to have, is not catastrophic in the sense that 
it is not causing serious/obvious problems to the clients, but makes us 
feel rather uncomfortable with the overall cluster health because, as 
you said, is a warning sign that something is wrong. It doesn't happen 
very often either, but I don't think this makes the situation any 
better. Apart from increasing the GC logging, I don't see any other way 
of debugging this further.


Thanks for your input,

Vasilis

On 20/01/15 22:53, Robert Coli wrote:
On Sat, Jan 17, 2015 at 3:32 PM, Vasileios Vlachos 
vasileiosvlac...@gmail.com mailto:vasileiosvlac...@gmail.com wrote:


Is there any other occasion that hints are stored and then being
sent in a cluster, other than network or other temporary or
permanent failure? Could it be that the client responsible for
establishing a connection is causing this? We use the Datastax C#
driver for connecting to the cluster and we run C* 1.2.18 on
Ubuntu 12.04.


Other than restarting nodes manually (which I consider a temporary 
failure for the purposes of this question), no. Seeing hints being 
stored and delivered outside of this context is a warning sign that 
something may be wrong with your cluster.


Probably what is happening is that you have stop the world GCs long 
enough to trigger queueing of hints via timeouts during these GCs.


=Rob


Should one expect to see hints being stored/delivered occasionally?

2015-01-17 Thread Vasileios Vlachos

Hello,

I thought hints are being stored on /node_A/ every time /node_B/ is 
unavailable for whatever the reason. I also thought that these hints are 
being delivered from /node_A/ to /node_B/ when /node_B/ is back and this 
is true for a /period = max_hint_window_in_ms/. After that hints are 
dropped and therefore never delivered to /node_B/.


Obviously I am wrong, because occasionally we get alerted from our 
monitoring system that hints are being stored and delivered, which as 
far as I know indicates a problem. Now, when that happens I cannot 
correlate it with any network issues (all nodes are on the same LAN 
anyway) or other problems. The output from /system.log/ looks like this:


INFO [CompactionExecutor:109085] 2015-01-17 15:35:13,536 
CompactionTask.java (line 262) Compacted 2 sstables to 
[/var/lib/cassandra/data/DataMining/quotebyquotereference/DataMining-quoteby
quotereference-ic-89765,].  222,905,570 bytes to 222,881,859 (~99% of 
original) in 91,850ms = 2.314172MB/s.  161,259 total rows, 161,253 
unique.  Row merge counts were {1:161247, 2:6, }
 INFO [CompactionExecutor:109090] 2015-01-17 15:35:13,537 
CompactionTask.java (line 105) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/DataMining/quotebyquotereference/DataMining-
quotebyquotereference-ic-89750-Data.db'), 
SSTableReader(path='/var/lib/cassandra/data/DataMining/quotebyquotereference/DataMining-quotebyquotereference-ic-89765-Data.db')]
 INFO [HintedHandoff:2] 2015-01-17 15:35:38,564 
HintedHandOffManager.java (line 294) Started hinted handoff for host: 
2ae2c679-8769-44da-a713-3bc21c670620 with IP: /10.3.5.3
 INFO [HintedHandoff:1] 2015-01-17 15:35:38,564 
HintedHandOffManager.java (line 294) Started hinted handoff for host: 
0bb63124-6333-43fa-b1c8-3a8f6627b85a with IP: /10.3.5.2
 INFO [HintedHandoff:1] 2015-01-17 15:35:38,967 
HintedHandOffManager.java (line 326) Finished hinted handoff of 17 rows 
to endpoint /10.3.5.2
 INFO [HintedHandoff:1] 2015-01-17 15:35:38,968 ColumnFamilyStore.java 
(line 633) Enqueuing flush of Memtable-hints@1779218028(614406/2848765 
serialized/live bytes, 220 ops)
 INFO [FlushWriter:9360] 2015-01-17 15:35:38,969 Memtable.java (line 
398) Writing Memtable-hints@1779218028(614406/2848765 serialized/live 
bytes, 220 ops)
 INFO [FlushWriter:9360] 2015-01-17 15:35:39,192 Memtable.java (line 
436) Completed flushing 
/var/lib/cassandra/data/system/hints/system-hints-ic-89-Data.db (176861 
bytes) for commitlog position ReplayPosition(segmentId=1418136927153, 
position=20201767)
 INFO [CompactionExecutor:109094] 2015-01-17 15:35:39,194 
CompactionTask.java (line 105) Compacting 
[SSTableReader(path='/var/lib/cassandra/data/system/hints/system-hints-ic-89-Data.db')]
 INFO [CompactionExecutor:109094] 2015-01-17 15:35:39,485 
CompactionTask.java (line 262) Compacted 1 sstables to 
[/var/lib/cassandra/data/system/hints/system-hints-ic-90,]. 176,861 
bytes to 177,355 (~100% of original) in 290ms = 0.583238MB/s.  4 total 
rows, 3 unique.  Row merge counts were {1:4, }
 INFO [HintedHandoff:1] 2015-01-17 15:35:39,485 
HintedHandOffManager.java (line 294) Started hinted handoff for host: 
6b99058f-ba48-42b9-baa1-a878a74338cc with IP: /10.3.5.1
 INFO [HintedHandoff:1] 2015-01-17 15:35:40,084 
HintedHandOffManager.java (line 326) Finished hinted handoff of 22 rows 
to endpoint /10.3.5.1
 INFO [HintedHandoff:1] 2015-01-17 15:35:40,085 ColumnFamilyStore.java 
(line 633) Enqueuing flush of Memtable-hints@1204004752(2356/10923 
serialized/live bytes, 62 ops)
 INFO [FlushWriter:9360] 2015-01-17 15:35:40,085 Memtable.java (line 
398) Writing Memtable-hints@1204004752(2356/10923 serialized/live bytes, 
62 ops)


Is there any other occasion that hints are stored and then being sent in 
a cluster, other than network or other temporary or permanent failure? 
Could it be that the client responsible for establishing a connection is 
causing this? We use the Datastax C# driver for connecting to the 
cluster and we run C* 1.2.18 on Ubuntu 12.04.


Many thanks,

Vasilis


Re: Node stuck during nodetool rebuild

2014-08-06 Thread Vasileios Vlachos
Hello Mark and Rob,

Thank you very much for your input, I will increase the phi threshold and
report back any progress.

Vasilis
On 5 Aug 2014 21:52, Mark Reddy mark.re...@boxever.com wrote:

 Hi Vasilis,

 To further on what Rob said

 I believe you might be able to tune the phi detector threshold to help
 this operation complete, hopefully someone with direct experience of same
 will chime in.


 I have been through this operation where streams break due to a node
 falsely being marked down (flapping). In an attempt to  mitigate this I
 increase the phi_convict_threshold in cassandra.yaml from 8 to 10, after
 which the rebuild was able to successfully complete. The default value for
 phi_convict_threshold is 8 with 12 being the maximum recommended value.


 Mark


 On Tue, Aug 5, 2014 at 7:22 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Aug 5, 2014 at 1:28 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

 The problem is that the nodetool seems to be stuck, and nodetool
 netstats on node1 of DC2 appears to be stuck at 10% streaming a 5G file
 from node2 at DC1. This doesn't tally with nodetool netstats when running
 it against either of the DC1 nodes. The DC1 nodes don't think they stream
 anything to DC2.


 Yes, streaming is fragile and breaks and hangs forever and your only
 option in most cases is to stop the rebuilding node, nuke its data, and
 start again.

 I believe you might be able to tune the phi detector threshold to help
 this operation complete, hopefully someone with direct experience of same
 will chime in.

 =Rob






Re: Node stuck during nodetool rebuild

2014-08-06 Thread Vasileios Vlachos
Actually something else I would like to ask... Do you know if phi is
related to streaming_socket_timeout_in_ms? It seems to be set to infinity
by default. Could that be related to the hang behaviour of rebuild? Would
you recommend changing the default or I have completely misinterpreted its
meaning?

Many thanks,

Vasilis
On 5 Aug 2014 21:52, Mark Reddy mark.re...@boxever.com wrote:

 Hi Vasilis,

 To further on what Rob said

 I believe you might be able to tune the phi detector threshold to help
 this operation complete, hopefully someone with direct experience of same
 will chime in.


 I have been through this operation where streams break due to a node
 falsely being marked down (flapping). In an attempt to  mitigate this I
 increase the phi_convict_threshold in cassandra.yaml from 8 to 10, after
 which the rebuild was able to successfully complete. The default value for
 phi_convict_threshold is 8 with 12 being the maximum recommended value.


 Mark


 On Tue, Aug 5, 2014 at 7:22 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Aug 5, 2014 at 1:28 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

 The problem is that the nodetool seems to be stuck, and nodetool
 netstats on node1 of DC2 appears to be stuck at 10% streaming a 5G file
 from node2 at DC1. This doesn't tally with nodetool netstats when running
 it against either of the DC1 nodes. The DC1 nodes don't think they stream
 anything to DC2.


 Yes, streaming is fragile and breaks and hangs forever and your only
 option in most cases is to stop the rebuilding node, nuke its data, and
 start again.

 I believe you might be able to tune the phi detector threshold to help
 this operation complete, hopefully someone with direct experience of same
 will chime in.

 =Rob






Node stuck during nodetool rebuild

2014-08-05 Thread Vasileios Vlachos
Hello All,

We are on 1.2.18 (running on Ubuntu 12.04) and we recently tried to add a
second DC on our demo environment, just before trying it on live. The
existing DC1 has two nodes which approximately hold 10G of data (RF=2). In
order to add the second DC, DC2, we followed this procedure:

On DC1 nodes:
1. Changed the Snitch in the cassandra.yaml from default to
GossipingPropertyFileSnitch.
2. Configured the cassandra-rackdc.properties (DC1, RAC1).
3. Rolling restart
4. Update replication strategy for each keyspace, for example: ALTER
KEYSPACE keyspace WITH REPLICATION =
{'class':'NetworkTopologyStrategy','DC1':2};

On DC2 nodes:
5. Edit the cassandra.yaml with: auto_bootstrap: false, seeds (one IP from
DC1), cluster name to match whatever we have on DC1 nodes, correct IP
settings, num_tokens, initial_token left unset and finally the snitch
(GossipingPropertyFileSnitch, as in DC1).
6. Changed the cassandra-rackdc.properties (DC2, RAC1)

On the Application:
7. Changed the C# DataStax driver load balancing policy to be
DCAwareRoundRobinPolicy
8. Changed the application consistency level from QUORUM to LOCAL_QUORUM
9. After deleting the data, commitlog and saved_caches directory we started
cassandra both nodes in the new DC, DC2. According to the logs at this
point all nodes were able to see all other nodes with the correct/expected
output when running nodetool status.

On DC1 nodes:
10. After cassandra was running on DC2, we changed the Keyspace RF to
include the new DC as follows:  ALTER KEYSPACE keyspace WITH REPLICATION
= {'class':'NetworkTopologyStrategy','DC1':2, 'DC2':2};
11. As a last step and in order to stream the data across to the second DC,
we run this on node1 of DC2: nodetool rebuild DC1. After the successful
completion of this, we were planning to run the same on node2 of DC2.

The problem is that the nodetool seems to be stuck, and nodetool netstats
on node1 of DC2 appears to be stuck at 10% streaming a 5G file from node2
at DC1. This doesn't tally with nodetool netstats when running it against
either of the DC1 nodes. The DC1 nodes don't think they stream anything to
DC2.

It is worth pointing that initially we tried to run 'nodetool rebuild DC1'
on both nodes at DC2, given the small amount of data to be streamed in
total (approximately 10G as I explained above). We exoerienced the same
problem, with the only difference being that 'nodetool rebuild DC1' stuck
on both nodes at DC2 very soon after running it, whereas now it happened
only after running it for an hour or so. We thought the problem was that we
tried to run nodetool against both nodes at the same time. So, we tried
running it only against node 1 after we deleted all the data, commitlog and
caches on both nodes and started from step (9) again. Now nodetool rebuild
is running against node1 at DC2 for more than 12 hours with no luck... The
weird thing is that the cassandra logs appear to be clean and the VPN
between the two DCs has no problems at all.

Any thoughts? Have we missed something in the steps I described? Is
anything wrong in the procedure? Any help would be much appreciated.

Thanks,

Vasilis


Re: Multi-DC Environment Question

2014-06-16 Thread Vasileios Vlachos
Hello again,

Back to this after a while...

As far as I can tell whenever DC2 is unavailable, there is one node from
DC1 that acts as a coordinator. When DC2 is available again, this one node
sends the hints to only one node at DC2, which then sends any replicas to
the other nodes in the local DC (DC2). This ensures efficient cross-DC
bandwidth usage. I was watching system.hints on all nodes during this
test and this is the conclusion I came to.

Two things:
1. If the above is correct, does the same apply when performing
anti-entropy repair (without specifying a particular DC)? I'm just hoping
the answer to this is going to be YES, otherwise the VPN is not going to be
very happy in our case and we would prefer to not saturate it whenever
running nodetool repair. I suppose we could have a traffic limiter on the
firewalls worst case scenario but I would appreciate your input if you know
more on this.

2. As I described earlier, in order to test this I was watching the
system.hints CF in order to monitor any hints. I was looking to add a
Nagios check for this purpose. For that reason I was looking into JMX
Concole. I noticed that when a node stores hints, MBean
org.apache.cassandra.db:type=ColumnFamilies,keyspace=system,columnfamily=hints,
attribute MemtableColumnsCount goes up (although I would expect it to be
MemtableRowCount or something?). This attribute will retain its value,
until the other node becomes available and ready to receive the hints. I
was looking for another attribute somewhere to monitor the active hints. I
checked:

MBean
org.apache.cassandra.metrics:type=ColumnFamily,keyspace=system,scope=hints,name=PendingTasks,

MBean org.apache.cassandra.metrics:type=Storage,name=TotalHints,
MBean
org.apache.cassandra.metrics:type=Storage,name=TotalHintsInProgress,
MBean
org.apache.cassandra.metrics:type=ThreadPools,path=internal,scope=HintedHandoff,name=ActiveTasks
and even
MBean
org.apache.cassandra.metrics:type=HintedHandOffManager,name=Hints_not_stored-/
10.2.1.100 (this one will never go back to zero).

All of them would not increase whenever any hints are being sent (or at
least I didn't catch it because it was too fast or whatever?). Does anyone
know what all these attributes represent? It looks like there are more
specific hint attributes on a per CF basis, but I was looking for a more
generic one to begin with. Any help would be much appreciated.

Thanks in advance,

Vasilis


On Wed, Jun 4, 2014 at 1:42 PM, Vasileios Vlachos 
vasileiosvlac...@gmail.com wrote:

 Hello Matt,

 nodetool status:

 Datacenter: MAN
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 -- Address Load Owns (effective) Host ID Token Rack
 UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
 -9211685935328163899 RAC1
 UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
 -3511707179720619260 RAC1
 Datacenter: DER
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 -- Address Load Owns (effective) Host ID Token Rack
 UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
 -1277931707251349874 RAC1
 UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
 -9204412570946850701 RAC1

 I do not know why the cluster is not balanced at the moment, but it holds
 almost no data. I will populate it soon and see how that goes. The output
 of 'nodetool ring' just lists all the tokens assigned to each individual
 node, and as you can imagine it would be pointless to paste it here. I just
 did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
 as expected (4 nodes x 256 tokens each).

 Still have not got the answers to the other questions though...

 Thanks,

 Vasilis


 On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com
 wrote:

 Thanks Vasileios.  I think I need to make a call as to whether to switch
 to vnodes or stick with tokens for my Multi-DC cluster.

 Would you be able to show a nodetool ring/status from your cluster to see
 what the token assignment looks like ?

 Thanks

 Matt


 On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

  I should have said that earlier really... I am using 1.2.16 and Vnodes
 are enabled.

 Thanks,

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos






Re: Multi-DC Environment Question

2014-06-04 Thread Vasileios Vlachos
Hello Matt,

nodetool status:

Datacenter: MAN
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.103 89.34 KB 99.2% b7f8bc93-bf39-475c-a251-8fbe2c7f7239
-9211685935328163899 RAC1
UN 10.2.1.102 86.32 KB 0.7% 1f8937e1-9ecb-4e59-896e-6d6ac42dc16d
-3511707179720619260 RAC1
Datacenter: DER
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns (effective) Host ID Token Rack
UN 10.2.1.101 75.43 KB 0.2% e71c7ee7-d852-4819-81c0-e993ca87dd5c
-1277931707251349874 RAC1
UN 10.2.1.100 104.53 KB 99.8% 7333b664-ce2d-40cf-986f-d4b4d4023726
-9204412570946850701 RAC1

I do not know why the cluster is not balanced at the moment, but it holds
almost no data. I will populate it soon and see how that goes. The output
of 'nodetool ring' just lists all the tokens assigned to each individual
node, and as you can imagine it would be pointless to paste it here. I just
did 'nodetool ring | awk ... | unique | wc -l' and it works out to be 1024
as expected (4 nodes x 256 tokens each).

Still have not got the answers to the other questions though...

Thanks,

Vasilis


On Wed, Jun 4, 2014 at 12:28 AM, Matthew Allen matthew.j.al...@gmail.com
wrote:

 Thanks Vasileios.  I think I need to make a call as to whether to switch
 to vnodes or stick with tokens for my Multi-DC cluster.

 Would you be able to show a nodetool ring/status from your cluster to see
 what the token assignment looks like ?

 Thanks

 Matt


 On Wed, Jun 4, 2014 at 8:31 AM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

  I should have said that earlier really... I am using 1.2.16 and Vnodes
 are enabled.

 Thanks,

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos





Re: Multi-DC Environment Question

2014-06-03 Thread Vasileios Vlachos

Thanks for your responses!

Matt, I did a test with 4 nodes, 2 in each DC and the answer appears to 
be yes. The tokens seem to be unique across the entire cluster, not just 
on a per DC basis. I don't know if the number of nodes deployed is 
enough to reassure me, but this is my conclusion for now. Please, 
correct me if you know I'm wrong.


Rob, this is the plan of attack I have in mind now. Although, in case of 
a catastrophic failure of a DC, the downtime is usually longer than 
that. So it's either less than the default value (when testing that the 
DR works for example) or more (actually using the DR as primary DC). 
Based on that, the default seems reasonable to me.


I also found that nodetool repair can be performed on one DC only by 
specifying the --in-local-dc option. So, presumably the classic nodetool 
repair applies to the entire cluster (sounds obvious, but is that 
actually correct?).


Question 3 in my previous email still remains unanswered to me... I 
cannot find out if there is only one hint stored in the coordinator 
irrespective of number of replicas being down, and also if the hint is 
100% of the size of the original write request.


Thanks,

Vasilis

On 03/06/14 18:52, Robert Coli wrote:
On Fri, May 30, 2014 at 4:08 AM, Vasileios Vlachos 
vasileiosvlac...@gmail.com mailto:vasileiosvlac...@gmail.com wrote:


Basically you sort of confirmed that if down_time 
max_hint_window_in_ms the only way to bring DC1 up-to-date is
anti-entropy repair.


Also, read repair does not help either as we assumed that
down_time  max_hint_window_in_ms. Please correct me if I am wrong.


My understanding is that if you :

1) set read repair chance to 100%
2) read all keys in the keyspace with a client

You would accomplish the same increase in consistency as you would by 
running repair.


In cases where this may matter, and your system can handle delivering 
the hints, increasing the already-increased-from-old-default-of-1-hour 
current default of 3 hours to 6 or more hours gives operators more 
time to work in the case of partition or failure. Note that hints are 
only an optimization, only repair (and read repair at 100%, I think..) 
assert any guarantee of consistency.


=Rob



--
Kind Regards,

Vasileios Vlachos



Re: Multi-DC Environment Question

2014-06-03 Thread Vasileios Vlachos
I should have said that earlier really... I am using 1.2.16 and Vnodes 
are enabled.


Thanks,

Vasilis

--
Kind Regards,

Vasileios Vlachos



Re: Multi-DC Environment Question

2014-05-30 Thread Vasileios Vlachos
Thanks for your responses, Ben thanks for the link.

Basically you sort of confirmed that if down_time  max_hint_window_in_ms
the only way to bring DC1 up-to-date is anti-entropy repair. Read
consistency level is irrelevant to the problem I described as I am reading
LOCAL_QUORUM. In this situation I lost whatever data -if any- had not been
transfered across to DC2 before DC1 went down, that is understandable.
Also, read repair does not help either as we assumed that down_time 
max_hint_window_in_ms. Please correct me if I am wrong.

I think I could better understand how that works if I knew the answers to
the following questions:
1. What is the output of nodetool status when a cluster spans across 2 DCs?
Will I be able to see ALL nodes irrespective of the DC they belong to?
2. How tokens are being assigned when adding a 2nd DC? Is the range -2^64
to 2^63 for each DC, or it is  -2^64 to 2^63 for the entire cluster? (I
think the latter is correct)
3. Does the coordinator store 1 hint irrespective of how many replicas
happen to be down at the time and also irrespective of DC2 being down in
the scenario I described above? (I think the answer is according to the
presentation you sent me, but I would like someone to confirm that)

Thank you in advance,

Vasilis


On Fri, May 30, 2014 at 3:13 AM, Ben Bromhead b...@instaclustr.com wrote:

 Short answer:

 If time elapsed  max_hint_window_in_ms then hints will stop being
 created. You will need to rely on your read consistency level, read repair
 and anti-entropy repair operations to restore consistency.

 Long answer:

 http://www.slideshare.net/jasedbrown/understanding-antientropy-in-cassandra

 Ben Bromhead
 Instaclustr | www.instaclustr.com | @instaclustr
 http://twitter.com/instaclustr | +61 415 936 359

 On 30 May 2014, at 8:40 am, Tupshin Harper tups...@tupshin.com wrote:

 When one node or DC is down, coordinator nodes being written through will
 notice this fact and store hints (hinted handoff is the mechanism),  and
 those hints are used to send the data that was not able to be replicated
 initially.

 http://www.datastax.com/dev/blog/modern-hinted-handoff

 -Tupshin
 On May 29, 2014 6:22 PM, Vasileios Vlachos vasileiosvlac...@gmail.com
 wrote:

  Hello All,

 We have plans to add a second DC to our live Cassandra environment.
 Currently RF=3 and we read and write at QUORUM. After adding DC2 we are
 going to be reading and writing at LOCAL_QUORUM.

 If my understanding is correct, when a client sends a write request, if
 the consistency level is satisfied on DC1 (that is RF/2+1), success is
 returned to the client and DC2 will eventually get the data as well. The
 assumption behind this is that the the client always connects to DC1 for
 reads and writes and given that there is a site-to-site VPN between DC1 and
 DC2. Therefore, DC1 will almost always return success before DC2 (actually
 I don't know if it is possible for DC2 to be more up-to-date than DC1 with
 this setup...).

 Now imagine DC1 looses connectivity and the client fails over to DC2.
 Everything should work fine after that, with the only difference that DC2
 will be now handling the requests directly from the client. After some
 time, say after max_hint_window_in_ms, DC1 comes back up. My question is
 how do I bring DC1 up to speed with DC2 which is now more up-to-date? Will
 that require a nodetool repair on DC1 nodes? Also, what is the answer
 when the outage is  max_hint_window_in_ms instead?

 Thanks in advance!

 Vasilis

 --
 Kind Regards,

 Vasileios Vlachos





Multi-DC Environment Question

2014-05-29 Thread Vasileios Vlachos

Hello All,

We have plans to add a second DC to our live Cassandra environment. 
Currently RF=3 and we read and write at QUORUM. After adding DC2 we are 
going to be reading and writing at LOCAL_QUORUM.


If my understanding is correct, when a client sends a write request, if 
the consistency level is satisfied on DC1 (that is RF/2+1), success is 
returned to the client and DC2 will eventually get the data as well. The 
assumption behind this is that the the client always connects to DC1 for 
reads and writes and given that there is a site-to-site VPN between DC1 
and DC2. Therefore, DC1 will almost always return success before DC2 
(actually I don't know if it is possible for DC2 to be more up-to-date 
than DC1 with this setup...).


Now imagine DC1 looses connectivity and the client fails over to DC2. 
Everything should work fine after that, with the only difference that 
DC2 will be now handling the requests directly from the client. After 
some time, say after max_hint_window_in_ms, DC1 comes back up. My 
question is how do I bring DC1 up to speed with DC2 which is now more 
up-to-date? Will that require a nodetool repair on DC1 nodes? Also, what 
is the answer when the outage is  max_hint_window_in_msinstead?


Thanks in advance!

Vasilis

--
Kind Regards,

Vasileios Vlachos



Re: Adding datacenter for move to vnodes

2014-02-07 Thread Vasileios Vlachos
Thanks for you input.

Yes, you can mix Vnode-enabled and Vnode-disabled nodes. What you described
is exactly what happened. We had a node which was responsible for 90%+ of
the load. What is the actual result of this though?

Say you have 6 nodes with 300G each. So you decommission N1 and you bring
it back in with Vnodes. Is that going to stream back 90%+ of the 300Gx6, or
it eventually will hold the 90%+ of all the data stored into your cluster?
If the second is what actually happens, this process should be safe on a
live cluster as well, given that you are going to upgrade the other 5 nodes
straight after...

Any thoughts?

Thanks,

Bill
On 7 Feb 2014 12:58, Alain RODRIGUEZ arodr...@gmail.com wrote:

 @Bill

 An other DC for this migration is the less impacting way to do it. You set
 a new cluster, switch to it when it's ready. No performance or down time
 issues.

 Decommissioning a node is quite an heavy operation since it will give part
 of its data to all the remaining nodes, increasing network, disk load and
 data size on all the remaining nodes.

 An other option is cassandra-shuffle, but afaik, it never worked
 properly and people recommend using a new cluster to switch.

 @Andrey  Bill

 I think you can mix vnodes with physical nodes, yet, you might have a node
 with 99% of the data, since it will take care of a lot of ranges (256 ?)
 while other nodes will take care of only 1. Might not be an issue on a dev
 or demo cluster but it will certainly be in a production environnement.




 2014-02-07 0:28 GMT+01:00 Andrey Ilinykh ailin...@gmail.com:

 My understanding is you can't mix vnodes and regular nodes in the same
 DC. Is it correct?



 On Thu, Feb 6, 2014 at 2:16 PM, Vasileios Vlachos 
 vasileiosvlac...@gmail.com wrote:

 Hello,

 My question is why would you need another DC to migrate to Vnodes? How
 about decommissioning each node in turn, changing the cassandra.yaml
 accordingly, delete the data and bring the node back in the cluster and let
 it bootstrap from the others?

 We did that recently with our demo cluster. Is that wrong in any way?
 The only think to take into consideration is the disk space I think. We are
 not using amazon, but I am not sure how would that be different for this
 particular issue.

 Thanks,

 Bill
 On 6 Feb 2014 16:34, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Glad it helps.

 Good luck with this.

 Cheers,

 Alain


 2014-02-06 17:30 GMT+01:00 Katriel Traum katr...@google.com:

 Thank you Alain! That was exactly what I was looking for. I was
 worried I'd have to do a rolling restart to change the snitch.

 Katriel



 On Thu, Feb 6, 2014 at 1:10 PM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi, we did this exact same operation here too, with no issue.

 Contrary to Paulo we did not modify our snitch.

 We simply added a dc_suffix in the property in
 cassandra-rackdc.properties conf file for nodes in the new cluster :

 # Add a suffix to a datacenter name. Used by the Ec2Snitch and
 Ec2MultiRegionSnitch

 # to append a string to the EC2 region name.

 dc_suffix=-xl

 So our new cluster DC is basically : eu-west-xl

 I think this is less risky, at least it is easier to do.

 Hope this help.


 2014-02-02 11:42 GMT+01:00 Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com:

 We had a similar situation and what we did was first migrate the 1.1
 cluster to GossipingPropertyFileSnitch, making sure that for each node 
 we
 specified the correct availability zone as the rack in
 the cassandra-rackdc.properties. In this way,
 the GossipingPropertyFileSnitch is equivalent to the 
 EC2MultiRegionSnitch,
 so the data location does not change and no repair is needed afterwards.
 So, if your nodes are located in the us-east-1e AZ, your 
 cassandra-rackdc.properties
 should look like:

 dc=us-east
 rack=1e

 After this step is complete on all nodes, then you can add a new
 datacenter specifying different dc and rack on the
 cassandra-rackdc.properties of the new DC. Make sure you upgrade your
 initial datacenter to 1.2 before adding a new datacenter with vnodes
 enabled (of course).

 Cheers


 On Sun, Feb 2, 2014 at 6:37 AM, Katriel Traum katr...@google.comwrote:

 Hello list.

 I'm upgrading a 1.1 cassandra cluster to 1.2(.13).
 I've read here and in other places that the best way to migrate to
 vnodes is to add a new DC, with the same amount of nodes, and run 
 rebuild
 on each of them.
 However, I'm faced with the fact that I'm using EC2MultiRegion
 snitch, which automagically creates the DC and RACK.

 Any ideas how I can go about adding a new DC with this kind of
 setup? I need these new machines to be in the same EC2 Region as the
 current ones, so adding to a new Region is not an option.

 TIA,
 Katriel




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200
 +55 83 9690-1314









Re: Adding datacenter for move to vnodes

2014-02-06 Thread Vasileios Vlachos
Hello,

My question is why would you need another DC to migrate to Vnodes? How
about decommissioning each node in turn, changing the cassandra.yaml
accordingly, delete the data and bring the node back in the cluster and let
it bootstrap from the others?

We did that recently with our demo cluster. Is that wrong in any way? The
only think to take into consideration is the disk space I think. We are not
using amazon, but I am not sure how would that be different for this
particular issue.

Thanks,

Bill
On 6 Feb 2014 16:34, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Glad it helps.

 Good luck with this.

 Cheers,

 Alain


 2014-02-06 17:30 GMT+01:00 Katriel Traum katr...@google.com:

 Thank you Alain! That was exactly what I was looking for. I was worried
 I'd have to do a rolling restart to change the snitch.

 Katriel



 On Thu, Feb 6, 2014 at 1:10 PM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi, we did this exact same operation here too, with no issue.

 Contrary to Paulo we did not modify our snitch.

 We simply added a dc_suffix in the property in
 cassandra-rackdc.properties conf file for nodes in the new cluster :

 # Add a suffix to a datacenter name. Used by the Ec2Snitch and
 Ec2MultiRegionSnitch

 # to append a string to the EC2 region name.

 dc_suffix=-xl

 So our new cluster DC is basically : eu-west-xl

 I think this is less risky, at least it is easier to do.

 Hope this help.


 2014-02-02 11:42 GMT+01:00 Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com:

 We had a similar situation and what we did was first migrate the 1.1
 cluster to GossipingPropertyFileSnitch, making sure that for each node we
 specified the correct availability zone as the rack in
 the cassandra-rackdc.properties. In this way,
 the GossipingPropertyFileSnitch is equivalent to the EC2MultiRegionSnitch,
 so the data location does not change and no repair is needed afterwards.
 So, if your nodes are located in the us-east-1e AZ, your 
 cassandra-rackdc.properties
 should look like:

 dc=us-east
 rack=1e

 After this step is complete on all nodes, then you can add a new
 datacenter specifying different dc and rack on the
 cassandra-rackdc.properties of the new DC. Make sure you upgrade your
 initial datacenter to 1.2 before adding a new datacenter with vnodes
 enabled (of course).

 Cheers


 On Sun, Feb 2, 2014 at 6:37 AM, Katriel Traum katr...@google.comwrote:

 Hello list.

 I'm upgrading a 1.1 cassandra cluster to 1.2(.13).
 I've read here and in other places that the best way to migrate to
 vnodes is to add a new DC, with the same amount of nodes, and run rebuild
 on each of them.
 However, I'm faced with the fact that I'm using EC2MultiRegion snitch,
 which automagically creates the DC and RACK.

 Any ideas how I can go about adding a new DC with this kind of setup?
 I need these new machines to be in the same EC2 Region as the current 
 ones,
 so adding to a new Region is not an option.

 TIA,
 Katriel




 --
 *Paulo Motta*

 Chaordic | *Platform*
 *www.chaordic.com.br http://www.chaordic.com.br/*
 +55 48 3232.3200
 +55 83 9690-1314







Unreachable Nodes

2013-05-22 Thread Vasileios Vlachos
Hello All,

A while ago we had 3 cassandra nodes on Amazon. At some point we decided to
buy some servers and deploy cassandra there. The problem is that since then
we have a list of dead IPs listed as UNREACHABLE nodes when we run describe
cluster on cassandra-cli.

I have seen other posts which describe similar issues, and the bottom line
is it's harmless but if you want to get rid of it do a full cluster
restart (I presume that means a rolling restart - not shut-down the entire
cluster right???). Anyway...

We also came across another solution: Install libmx4j-java, uncomment the
respective line on /etc/default/cassandra, restart the node, go to 
http://cassandra_node:8081/mbean?objectname=org.apache.cassandra.net%3Atype%3DGossiper;,
type in the dead IP/IPs next to the unsafeAssassinateEndpoint and invoke
it. So we did that on one of the nodes for the list of dead IPs. After
running describe cluster on the CLI on every node, we noticed that there
were no UNREACHABLE nodes and everything looked OK.

However, when we run nodetool gossipinfo we get the following output:

/10.1.32.97
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
LOAD:2.76851457173E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,56713727820156410577229101238628035243
/10.128.16.111
REMOVAL_COORDINATOR:REMOVER,113427455640312821154458202477256070486
STATUS:LEFT,42537039300520238181471502256297362072,1369471488145
/10.128.16.110
REMOVAL_COORDINATOR:REMOVER,1
STATUS:LEFT,42537092606577173116506557155915918934,1369471275829
/10.1.32.100
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
LOAD:2.75649392881E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,85070591730234615865843651857942052863
/10.1.32.101
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
LOAD:2.71158702006E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,141784319550391026443072753096570088105
/10.1.32.98
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
LOAD:2.73163150773E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,113427455640312821154458202477256070486
/10.128.16.112
REMOVAL_COORDINATOR:REMOVER,1
STATUS:LEFT,42537092606577173116506557155915918934,1369471567719
/10.1.32.99
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
LOAD:2.72271268395E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,28356863910078205288614550619314017621
/10.1.32.96
RELEASE_VERSION:1.0.11
SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
LOAD:2.71494331357E11
RPC_ADDRESS:0.0.0.0
STATUS:NORMAL,0

Does anyone know why the dead nodes still appear when we run nodetool
gossipinfo but they don't when we run describe cluster from the CLI?

Thank you in advance for your help,

Vasilis


Re: Unreachable Nodes

2013-05-22 Thread Vasileios Vlachos
Hello,

Thanks for your fast response. That makes sense. I'll just keep an eye on
it then.

Many thanks,

Vasilis


On Wed, May 22, 2013 at 10:54 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 Hi.

 I think that the unsafeAssassinateEndpoint was the good solution here. I
 was going to lead you to this solution after reading the first part of your
 message.

 Does anyone know why the dead nodes still appear when we run nodetool
 gossipinfo but they don't when we run describe cluster from the CLI?

 That's a good thing. Gossiper just keep this information for a while (7 or
 10 days by default off the top off my head), but this doesn't harm your
 cluster in any ways, but having UNREACHABLE nodes could have been
 annoying. By the way gossipinfo shows you those nodes as STATUS:LEFT
 which is good. I am quite sure that this status changed when you used the
 jmx unsafeAssassinateEndpoint.

 do a full cluster restart (I presume that means a rolling restart - not
 shut-down the entire cluster right???). 

 A full restart = entire cluster down = down time. It is precisely *not*
 a rolling restart.

 To conclude I would say that your cluster seems healthy now (from what I
 can see), you have no more ghost nodes and nothing to do. Just wait a week
 or so and look for gossipinfo again.


 2013/5/22 Vasileios Vlachos vasileiosvlac...@gmail.com

 Hello All,

 A while ago we had 3 cassandra nodes on Amazon. At some point we decided
 to buy some servers and deploy cassandra there. The problem is that since
 then we have a list of dead IPs listed as UNREACHABLE nodes when we run
 describe cluster on cassandra-cli.

 I have seen other posts which describe similar issues, and the bottom
 line is it's harmless but if you want to get rid of it do a full cluster
 restart (I presume that means a rolling restart - not shut-down the entire
 cluster right???). Anyway...

 We also came across another solution: Install libmx4j-java, uncomment
 the respective line on /etc/default/cassandra, restart the node, go to 
 http://cassandra_node:8081/mbean?objectname=org.apache.cassandra.net%3Atype%3DGossiper;,
 type in the dead IP/IPs next to the unsafeAssassinateEndpoint and invoke
 it. So we did that on one of the nodes for the list of dead IPs. After
 running describe cluster on the CLI on every node, we noticed that there
 were no UNREACHABLE nodes and everything looked OK.

 However, when we run nodetool gossipinfo we get the following output:

 /10.1.32.97
  RELEASE_VERSION:1.0.11
 SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
 LOAD:2.76851457173E11
 RPC_ADDRESS:0.0.0.0
 STATUS:NORMAL,56713727820156410577229101238628035243
 /10.128.16.111
 REMOVAL_COORDINATOR:REMOVER,113427455640312821154458202477256070486
 STATUS:LEFT,42537039300520238181471502256297362072,1369471488145
 /10.128.16.110
 REMOVAL_COORDINATOR:REMOVER,1
 STATUS:LEFT,42537092606577173116506557155915918934,1369471275829
 /10.1.32.100
 RELEASE_VERSION:1.0.11
 SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
 LOAD:2.75649392881E11
 RPC_ADDRESS:0.0.0.0
 STATUS:NORMAL,85070591730234615865843651857942052863
 /10.1.32.101
 RELEASE_VERSION:1.0.11
 SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
 LOAD:2.71158702006E11
 RPC_ADDRESS:0.0.0.0
 STATUS:NORMAL,141784319550391026443072753096570088105
 /10.1.32.98
 RELEASE_VERSION:1.0.11
 SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
 LOAD:2.73163150773E11
 RPC_ADDRESS:0.0.0.0
 STATUS:NORMAL,113427455640312821154458202477256070486
 /10.128.16.112
 REMOVAL_COORDINATOR:REMOVER,1
 STATUS:LEFT,42537092606577173116506557155915918934,1369471567719
 /10.1.32.99
 RELEASE_VERSION:1.0.11
 SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
 LOAD:2.72271268395E11
 RPC_ADDRESS:0.0.0.0
 STATUS:NORMAL,28356863910078205288614550619314017621
 /10.1.32.96
 RELEASE_VERSION:1.0.11
 SCHEMA:b1116df0-b3dd-11e2--16fe4da5dbff
 LOAD:2.71494331357E11
 RPC_ADDRESS:0.0.0.0
 STATUS:NORMAL,0

 Does anyone know why the dead nodes still appear when we run nodetool
 gossipinfo but they don't when we run describe cluster from the CLI?

 Thank you in advance for your help,

 Vasilis





Re: Replication Factor and Consistency Level Confusion

2012-12-20 Thread Vasileios Vlachos
Hello,

Thank you very much for your quick responses.

Initially we were thinking the same thing, that an explanation would
be that the wrong node could be down, but then isn't this something
that hinted handoff sorts out? So actually, Consistency Level refers
to the number of replicas, not the total number of nodes in a cluster.
Keeping that in mind and assuming that hinted handoff has nothing to
do with that as I thought, I could explain some results but not all.
Let me explain:

Test 1 (3/3 Nodes UP):
CL  :ANY ONETWOTHREEQUORUM   ALL
RF 3:OK  OK OK OK   OK   OK

Test 2 (2/3 Nodes UP):
CL  :ANYONETWOTHREEQUORUMALL
RF 2:OK OK x  xOKx

Test 3 (2/3 Nodes UP):
CL  :ANYONETWOTHREEQUORUMALL
RF 3:OK OK x  xOKOK

Test 1:
Everything was fine because all nodes were up and the RF does not
exceed the total number of nodes, in which case writes would be
blocked.

Test 2:
CL=TWO did not work because we were unlucky and the wrong node,
responsible for the key range we were trying to insert, was DOWN (I
can accept that for now, however I do not quite understand why isn't
this sorted by the hinted handoff). My explanation might be wrong
again, but CL=THREE should fail because we only have set RF=2, so
there isn't a 3rd replica anywhere anyway. Why did CL=QUORUM not fail
then? Since QUORUM=(RF/2)+1=2 in this case, the write operation should
try to write in 2 replicas, one of which, the one responsible for that
range as we said, is DOWN. I should expect CL=2 and CL=QUORUM to have
the same outcome in this case. Why that's not the case? CL=ALL fails
for the same reason as CL=TWO I presume.

Test 3:
I was expecting only CL=ANY and CL=ONE to work in this case. CL=TWO
does not work because , just like with Test 2, the same situation
applies with the node responsible for that particular key range to be
DOWN. If that's the case, why CL=QUORUM was successful??? The only
explanation I can thing of at the moment is that QUORUM explicitly
refers to the total number of nodes in the cluster rather than the
number of replicas determined by the RF. CL=THREE seems easy, it fails
because one of the three replicas is DOWN. CL=ALL is confusing as
well. If my understanding is correct and ALL means all replicas, 3 in
this case, then the operation should fail because one replica is DOWN
and I can not be lucky to have the right node DOWN, because RF=3.
So, every node should have a copy of the data.

Furthermore, with regards to being unlucky with the wrong node if
this actually what is happening, how is it possible to ever have a
node-failure resiliant cassandra cluster? My understanding of this
implies that even with 100 nodes, every 1/100 writes would fail until
the node is replaced/repaired.

Thank you very much in advance.

Vasilis

On Wed, Dec 19, 2012 at 4:18 PM, Roland Gude roland.g...@ez.no wrote:

 Hi

 RF 2 means that 2 nodes are responsible for any given row (no matter how
 many nodes are in the cluster)
 For your cluster with three nodes let's just assume the following
 responsibilities

 NodeA   B   C
 Primary keys0-5 6-1011-15
 Replica keys11-15   0-5 6-10

 Assume node 'C' is down
 Writing any key in range 0-5 with consistency TWO is possible (A and B are
 up)
 Writing any key in range 11-15 with consistency TWO will fail (C is down
 and 11-15 is its primary range)
 Writing any key in range 6-10 with consistency TWO will fail (C is down
 and it is the replica for this range)

 I hope this explains it.

 -Ursprüngliche Nachricht-
 Von: Vasileios Vlachos [mailto:vasileiosvlac...@gmail.com]
 Gesendet: Mittwoch, 19. Dezember 2012 17:07
 An: user@cassandra.apache.org
 Betreff: Replication Factor and Consistency Level Confusion

 Hello All,

 We have a 3-node cluster and we created a keyspace (say Test_1) with
 Replication Factor set to 3. I know is not great but we wanted to test
 different behaviors. So, we created a Column Family (say cf_1) and we tried
 writing something with Consistency Level ANY, ONE, TWO, THREE, QUORUM and
 ALL. We did that while all nodes were in UP state, so we had no problems at
 all. No matter what the Consistency Level was, we were able to insert a
 value.

 Same cluster, different keyspace (say Test_2) with Replication Factor set
 to 2 this time and one of the 3 nodes deliberately DOWN. Again, we created a
 Column Family (say cf_1) and we tried writing something with different
 Consistency Levels. Here is what we got:
 ANY: worked (expected...)
 ONE: worked (expected...)
 TWO: did not work (WHT???)
 THREE: did not work (expected...)
 QUORUM: worked (expected...)
 ALL: did not work (expected I guess...)

 Now, we know that QUORUM derives from (RF/2)+1, so we were expecting that
 to work, after all only 1 node was DOWN. Why did Consistency Level TWO

Replication Factor and Consistency Level Confusion

2012-12-19 Thread Vasileios Vlachos
Hello All,

We have a 3-node cluster and we created a keyspace (say Test_1) with
Replication Factor set to 3. I know is not great but we wanted to test
different behaviors. So, we created a Column Family (say cf_1) and we
tried writing something with Consistency Level ANY, ONE, TWO, THREE,
QUORUM and ALL. We did that while all nodes were in UP state, so we
had no problems at all. No matter what the Consistency Level was, we
were able to insert a value.

Same cluster, different keyspace (say Test_2) with Replication Factor
set to 2 this time and one of the 3 nodes deliberately DOWN. Again, we
created a Column Family (say cf_1) and we tried writing something with
different Consistency Levels. Here is what we got:
ANY: worked (expected...)
ONE: worked (expected...)
TWO: did not work (WHT???)
THREE: did not work (expected...)
QUORUM: worked (expected...)
ALL: did not work (expected I guess...)

Now, we know that QUORUM derives from (RF/2)+1, so we were expecting
that to work, after all only 1 node was DOWN. Why did Consistency
Level TWO not work then???

Third test... Same cluster again, different keyspace (say Test_3) with
Replication Factor set to 3 this time and 1 of the 3 nodes
deliberately DOWN again. Same approach again, created different Column
Family (say cf_1) and different Consistency Level settings resulted in
the following:
ANY: worked (what???)
ONE: worked (what???)
TWO: did not work (what???)
THREE: did not work (expected...)
QUORUM: worked (what???)
ALL: worked (what???)

We thought that if the Replication Factor is greater than the number
of nodes in the cluster, writes are blocked.

Apparently we are completely missing the a level of understanding
here, so we would appreciate any help!

Thank you in advance!

Vasilis


Re: Using Cassandra to store binary files?

2012-10-19 Thread Vasileios Vlachos
Hello,

Thank you all for your responses.

Performance is not an issue at all as I described, so it shouldn't be
problematic. At least this is our current understanding. We will try it and
post back if something interesting comes up. Many thanks.

Regards,

Vasilis



On Tue, Oct 16, 2012 at 7:34 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

 I am not sure.  If I were to implement it myself though, I would have
 probably...

 postfixed the rows with 1,2,3,4,...lastValue and then stored the lastValue
 in the first row so then my program knows all the rows.

 Ie. Not sure an index is really needed in that case.

 Dean

 On 10/16/12 11:45 AM, Michael Kjellman mkjell...@barracuda.com wrote:

 Ah, so they just wrote chunking into Astyanax? Do they create an index
 somewhere so they know how to reassemble the file on the way out?
 
 On 10/16/12 10:36 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 
 Yes, astyanax stores the file in many rows so it reads from many disks
 giving you a performance advantage vs. storing each file in one rowwell
 at least from my understanding so read performance should be really
 really good in that case.
 
 Dean
 
 From: Michael Kjellman
 mkjell...@barracuda.commailto:mkjell...@barracuda.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, October 16, 2012 10:07 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Re: Using Cassandra to store binary files?
 
 When we started with Cassandra almost 2 years ago in production
 originally it was for the sole purpose storing blobs in a redundant way.
 I ignored the warnings as my own tests showed it would be okay (and two
 years later it is ok). If you plan on using Cassandra later (as we now
 as as features such as secondary indexes and cql have matured I'm now
 stuck with a large amount of data in Cassandra that maybe could be in a
 better place.) Does it work? Yes. Would I do it again? Not 100% sure.
 Compactions of these column families take forever.
 
 Also, by default there is a 16MB limit. Yes, this is adjustable but
 currently Thrift does not stream data. I didn't know that Netflix had
 worked around this (referring to Dean's reply) -- I'll have to look
 through the source to see how they are overcoming the limitations of the
 protocol. Last I read there were no plans to make Thrift stream. Looks
 like there is a bug at
 https://issues.apache.org/jira/browse/CASSANDRA-265
 
 You might want to take a look at the following page:
 http://wiki.apache.org/cassandra/CassandraLimitations
 
 I wanted an easy key value store when I originally picked Cassandra. As
 our project needs changed and Cassandra has now begun playing a more
 critical role as it has matured (since the 0.7 days), in retrospect HDFS
 might have been a better option long term as I really will never need
 indexing etc on my binary blobs and the convenience of simply being able
 to grab/reassemble a file by grabbing it's key was convenient at the time
 but maybe not the most forward thinking. Hope that helps a bit.
 
 Also, your read performance won't be amazing by any means with blobs. Not
 sure if your priority is reads or writes. In our case it was writes so it
 wasn't a large loss.
 
 Best,
 michael
 
 
 From: Vasileios Vlachos
 vasileiosvlac...@gmail.commailto:vasileiosvlac...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, October 16, 2012 8:49 AM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Using Cassandra to store binary files?
 
 Hello All,
 
 We need to store about 40G of binary files in a redundant way and since
 we are already using Cassandra for other applications we were thinking
 that we could just solve that problem using the same Cassandra cluster.
 Each individual File will be approximately 1MB.
 
 We are thinking that the data structure should be very simple for this
 case, using one CF with just one column which will contain the actual
 files. The row key should then uniquely identify each file. Speed is not
 an issue when we retrieving the files. Impacting other applications using
 Cassandra is more important for us. In order to prevent performance
 issues with other applications using our Cassandra cluster at the moment,
 we think we should disable key_cache and row_cache for this column
 family.
 
 Anyone tried this before or anyone thinks this is going to be a bad idea?
 Do you think our current plan is sensible? Any input would be much
 appreciated. Thank you in advance.
 
 Regards,
 
 Vasilis
 
 --
 'Like' us on Facebook for exclusive content and other resources on all
 Barracuda Networks solutions.
 Visit http://barracudanetworks.com/facebook

Using Cassandra to store binary files?

2012-10-16 Thread Vasileios Vlachos
Hello All,

We need to store about 40G of binary files in a redundant way and since we
are already using Cassandra for other applications we were thinking that we
could just solve that problem using the same Cassandra cluster. Each
individual File will be approximately 1MB.

We are thinking that the data structure should be very simple for this
case, using one CF with just one column which will contain the actual
files. The row key should then uniquely identify each file. Speed is not an
issue when we retrieving the files. Impacting other applications using
Cassandra is more important for us. In order to prevent performance issues
with other applications using our Cassandra cluster at the moment, we think
we should disable key_cache and row_cache for this column family.

Anyone tried this before or anyone thinks this is going to be a bad idea?
Do you think our current plan is sensible? Any input would be much
appreciated. Thank you in advance.

Regards,

Vasilis


Re: Thrift version and OOM errors

2012-07-09 Thread Vasileios Vlachos
Hello,

Thanks for the help. There was a problem in the code actually... The
connection object was not thread safe. That is why the messages were so
big.

After fixing that we do not get any errors. The cluster seems stable.

Thanks again for all the help.

Regards,

Vasilis



On Thu, Jul 5, 2012 at 11:32 PM, aaron morton aa...@thelastpickle.comwrote:

 agree.

 It's a good idea to remove as many variables and possible and get to a
 stable/known state. Use a clean install and a well known client and see if
 the problems persist.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 5/07/2012, at 4:58 PM, Tristan Seligmann wrote:

 On Jul 4, 2012 2:02 PM, Vasileios Vlachos vasileiosvlac...@gmail.com
 wrote:
 
  Any ideas what could be causing strange message lengths?

 One cause of this that I've seen is a client using unframed Thrift
 transport while the server expects framed, or vice versa. I suppose a
 similar cause could be something that is not a Thrift client at all
 mistakenly connecting to Cassandra's Thrift port.





Re: Thrift version and OOM errors

2012-07-04 Thread Vasileios Vlachos
Hello Aaron, thanks for your email.

- That's pretty small, try m1.xlarge.

Yes, this is small. We are aware of that, but that doesn't seem to be the
actual problem. But we cannot see any reason why this shouldn't work as a
test environment. After we get a fair understanding we are going to invest
on proper hardware.

- 1.0.7 ships with thrift  0.6
- What client are you using ? If you have rolled your own client try
using one of
- the pre-built ones to rule out errors in your code.

So, we are now using the right thrift version I guess, unless there are
significant changes between 0.6.1 and 0.6. But if that's the case, why are
we still getting 'old-client' errors???

At the moment we use thrift directly. We might start developing our own
client using C#.

- mmm 1.83 GB message size. Something is not right there.

Do you have any ideas what could be causing that? We are definitely
not trying to store such a large message.

- 208 MB message size which is too big (max is 16MB) followed by out of memory.

We cannot figure out why messages appear to be so large. We are aware of
the 16MB limit and we are not even close to that limit. What could be
causing such a large message size?

- Do you get these errors with a stock 1.0.X install and a pre-built client ?

We have not tested it with a higher level client yet. Do you think we
should not be using thrift alone? Could that be what causes all these
errors?

Thanks in advance for your help,

Regards,

Vasilis



On Wed, Jul 4, 2012 at 11:54 AM, aaron morton aa...@thelastpickle.comwrote:

 We are using Cassandra 1.0.7 on AWS on mediums (that is 3.8G RAM, 1 Core),

 That's pretty small, try m1.xlarge.

 e are still not sure what version of thrift to use with Cassandra 1.0.7
 (we are still getting the same message regarding the 'old client').

 1.0.7 ships with thrift  0.6
 What client are you using ? If you have rolled your own client try using
 one of the pre-built ones to rule out errors in your code.

 org.apache.thrift.TException: Message length exceeded: 1970238464

 mmm 1.83 GB message size. Something is not right there.


 org.apache.thrift.TException: Message length exceeded: 218104076

 208 MB message size which is too big (max is 16MB) followed by out of
 memory.

 Do you get these errors with a stock 1.0.X install and a pre-built client ?

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 3/07/2012, at 9:57 AM, Vasileios Vlachos wrote:

 Hello All,

 We are using Cassandra 1.0.7 on AWS on mediums (that is 3.8G RAM, 1 Core),
 running Ubuntu 12.04. We have three nodes in the cluster and we hit only
 one node from our application. Thrift version is 0.6.1 (we changed from 0.8
 because we thought there was a compatibility problem between thrift and
 Cassandra ('old client' according to the output.log). We are still not sure
 what version of thrift to use with Cassandra 1.0.7 (we are still getting
 the same message regarding the 'old client'). I would appreciate any help
 on that please.

 Below, I am sharing the errors we are getting from the output.log file.
 First three errors are not responsible for the crash, only the OOM error
 is, but something seems to be really wrong there...

 Error #1

 ERROR 14:00:12,057 Thrift error occurred during processing of message.
 org.apache.thrift.TException: Message length exceeded: 1970238464
 at
 org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:102)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:121)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
 at org.apache.cassandra.thrift.Mutation.read(Mutation.java:355)
 at
 org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18966)
 at
 org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441)
 at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 Error #2

 ERROR 14:03:48,004 Error occurred during processing of message.
 java.lang.StringIndexOutOfBoundsException: String index out of range: -
 2147418111
 at java.lang.String.checkBounds(String.java:397)
 at java.lang.String.init(String.java:442)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readString

Re: Thrift version and OOM errors

2012-07-04 Thread Vasileios Vlachos
We also get negative message lengths occasionally... Please see below:

ERROR 12:49:00,777 Thrift error occurred during processing of message.
org.apache.thrift.TException: Negative length: -2147483634
at
org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:388)
at
org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363)
at org.apache.cassandra.thrift.Column.read(Column.java:528)
at
org.apache.cassandra.thrift.ColumnOrSuperColumn.read(ColumnOrSuperColumn.java:507)
at org.apache.cassandra.thrift.Mutation.read(Mutation.java:353)
at
org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18966)
at
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

-

Any ideas what could be causing strange message lengths?

Thanks,

Vasilis




On Wed, Jul 4, 2012 at 12:55 PM, Vasileios Vlachos 
vasileiosvlac...@gmail.com wrote:

 Hello Aaron, thanks for your email.

 - That's pretty small, try m1.xlarge.

 Yes, this is small. We are aware of that, but that doesn't seem to be the
 actual problem. But we cannot see any reason why this shouldn't work as a
 test environment. After we get a fair understanding we are going to invest
 on proper hardware.

 - 1.0.7 ships with thrift  0.6
 - What client are you using ? If you have rolled your own client try using 
 one of
 - the pre-built ones to rule out errors in your code.

 So, we are now using the right thrift version I guess, unless there are
 significant changes between 0.6.1 and 0.6. But if that's the case, why are
 we still getting 'old-client' errors???

 At the moment we use thrift directly. We might start developing our own
 client using C#.

 - mmm 1.83 GB message size. Something is not right there.

 Do you have any ideas what could be causing that? We are definitely not 
 trying to store such a large message.

 - 208 MB message size which is too big (max is 16MB) followed by out of 
 memory.

 We cannot figure out why messages appear to be so large. We are aware of
 the 16MB limit and we are not even close to that limit. What could be
 causing such a large message size?

 - Do you get these errors with a stock 1.0.X install and a pre-built client ?

 We have not tested it with a higher level client yet. Do you think we
 should not be using thrift alone? Could that be what causes all these
 errors?

 Thanks in advance for your help,

 Regards,

 Vasilis



 On Wed, Jul 4, 2012 at 11:54 AM, aaron morton aa...@thelastpickle.comwrote:

 We are using Cassandra 1.0.7 on AWS on mediums (that is 3.8G RAM, 1 Core),

 That's pretty small, try m1.xlarge.

 e are still not sure what version of thrift to use with Cassandra 1.0.7
 (we are still getting the same message regarding the 'old client').

 1.0.7 ships with thrift  0.6
 What client are you using ? If you have rolled your own client try using
 one of the pre-built ones to rule out errors in your code.

 org.apache.thrift.TException: Message length exceeded: 1970238464

 mmm 1.83 GB message size. Something is not right there.


 org.apache.thrift.TException: Message length exceeded: 218104076

 208 MB message size which is too big (max is 16MB) followed by out of
 memory.

 Do you get these errors with a stock 1.0.X install and a pre-built client
 ?

 Cheers


   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 3/07/2012, at 9:57 AM, Vasileios Vlachos wrote:

 Hello All,

 We are using Cassandra 1.0.7 on AWS on mediums (that is 3.8G RAM, 1
 Core), running Ubuntu 12.04. We have three nodes in the cluster and we hit
 only one node from our application. Thrift version is 0.6.1 (we changed
 from 0.8 because we thought there was a compatibility problem between
 thrift and Cassandra ('old client' according to the output.log). We are
 still not sure what version of thrift to use with Cassandra 1.0.7 (we are
 still getting the same message regarding the 'old client'). I would
 appreciate any help on that please.

 Below, I am sharing the errors we are getting from the output.log file.
 First three errors are not responsible for the crash, only the OOM error
 is, but something seems to be really wrong there...

 Error #1

 ERROR 14:00:12,057 Thrift error occurred during processing of message

Re: Thrift version and OOM errors

2012-07-03 Thread Vasileios Vlachos
Just an update to correct something...

The application hits 10.128.16.111. The last lines of Error #4 suggest
that 10.128.16.110 and 10.128.16.112 where down because Cassandra service
was down on 10.128.16.111 and it could not detect the cluster (I think it
must be gossip related, right???).

Thanks,

Vasilis


On Mon, Jul 2, 2012 at 10:57 PM, Vasileios Vlachos 
vasileiosvlac...@gmail.com wrote:

 Hello All,

 We are using Cassandra 1.0.7 on AWS on mediums (that is 3.8G RAM, 1 Core),
 running Ubuntu 12.04. We have three nodes in the cluster and we hit only
 one node from our application. Thrift version is 0.6.1 (we changed from 0.8
 because we thought there was a compatibility problem between thrift and
 Cassandra ('old client' according to the output.log). We are still not sure
 what version of thrift to use with Cassandra 1.0.7 (we are still getting
 the same message regarding the 'old client'). I would appreciate any help
 on that please.

 Below, I am sharing the errors we are getting from the output.log file.
 First three errors are not responsible for the crash, only the OOM error
 is, but something seems to be really wrong there...

 Error #1

 ERROR 14:00:12,057 Thrift error occurred during processing of message.
 org.apache.thrift.TException: Message length exceeded: 1970238464
 at
 org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:102)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:121)
 at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
 at org.apache.cassandra.thrift.Mutation.read(Mutation.java:355)
 at
 org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18966)
 at
 org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441)
 at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 Error #2

 ERROR 14:03:48,004 Error occurred during processing of message.
 java.lang.StringIndexOutOfBoundsException: String index out of range: -
 2147418111
 at java.lang.String.checkBounds(String.java:397)
 at java.lang.String.init(String.java:442)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:339)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:210)
 at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 Error #3

 ERROR 14:07:24,415 Thrift error occurred during processing of message.
 org.apache.thrift.protocol.TProtocolException: Missing version in
 readMessageBegin, old client?
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:213)
 at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)

 Error #4

 ERROR 16:07:10,168 Thrift error occurred during processing of message.
 org.apache.thrift.TException: Message length exceeded: 218104076
 at
 org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:352)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:347)
 at
 org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18958)
 at
 org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441)
 at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
 at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886

Thrift version and OOM errors

2012-07-02 Thread Vasileios Vlachos
Hello All,

We are using Cassandra 1.0.7 on AWS on mediums (that is 3.8G RAM, 1 Core),
running Ubuntu 12.04. We have three nodes in the cluster and we hit only
one node from our application. Thrift version is 0.6.1 (we changed from 0.8
because we thought there was a compatibility problem between thrift and
Cassandra ('old client' according to the output.log). We are still not sure
what version of thrift to use with Cassandra 1.0.7 (we are still getting
the same message regarding the 'old client'). I would appreciate any help
on that please.

Below, I am sharing the errors we are getting from the output.log file.
First three errors are not responsible for the crash, only the OOM error
is, but something seems to be really wrong there...

Error #1

ERROR 14:00:12,057 Thrift error occurred during processing of message.
org.apache.thrift.TException: Message length exceeded: 1970238464
at
org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393)
at
org.apache.thrift.protocol.TBinaryProtocol.readBinary(TBinaryProtocol.java:363)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:102)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:112)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:121)
at org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:60)
at org.apache.cassandra.thrift.Mutation.read(Mutation.java:355)
at
org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18966)
at
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Error #2

ERROR 14:03:48,004 Error occurred during processing of message.
java.lang.StringIndexOutOfBoundsException: String index out of range:
-2147418111
at java.lang.String.checkBounds(String.java:397)
at java.lang.String.init(String.java:442)
at
org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:339)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:210)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Error #3

ERROR 14:07:24,415 Thrift error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Missing version in
readMessageBegin, old client?
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:213)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Error #4

ERROR 16:07:10,168 Thrift error occurred during processing of message.
org.apache.thrift.TException: Message length exceeded: 218104076
at
org.apache.thrift.protocol.TBinaryProtocol.checkReadLength(TBinaryProtocol.java:393)
at
org.apache.thrift.protocol.TBinaryProtocol.readStringBody(TBinaryProtocol.java:352)
at
org.apache.thrift.protocol.TBinaryProtocol.readString(TBinaryProtocol.java:347)
at
org.apache.cassandra.thrift.Cassandra$batch_mutate_args.read(Cassandra.java:18958)
at
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3441)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
java.lang.OutOfMemoryError: Java heap space
Dumping heap to /var/lib/cassandra/java_1341224307.hprof ...
INFO 16:07:18,882 GC for Copy: 886 ms for 1 collections, 2242700896 used;
max is 2670985216
Java HotSpot(TM) 64-Bit Server VM warning: record is too large
Heap dump file created [4429997807 bytes in 95.755 secs]
INFO 16:08:54,749 GC for ConcurrentMarkSweep: 1157 ms for 4 collections,

Re: Cassandra running out of memory?

2012-04-15 Thread Vasileios Vlachos

Thank you Aaron. 8G memory is about the spec we use now for testing.

I observed a couple of other things when checked the output.log file but 
I think this should go to another post.


Thank you very much for your advice.

Bill


On 13/04/12 02:49, aaron morton wrote:

It depends on a lot of things: schema size, caches, work load etc.

If your are just starting out I would recommend using a machine with 
8gb or 16gb total ram. By default cassandra will take about 4gb or 8gb 
(respectively) for the JVM.


Once you have a feel for how things work you should be able to 
estimate the resources your application will need.


Hope that helps.

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 13/04/2012, at 2:19 AM, Vasileios Vlachos wrote:


Hello Aaron,

Thank you for getting back to me.

I will change to m1.large first to see how long it will take 
Cassandra node to die (if at all). If again not happy I will try more 
memory. I just want to test it step by step and see what the 
differences are. I will also change the cassandra-env file back to 
defaults.


Is there an absolute minimum requirement for Cassandra in terms of 
memory? I might be wrong, but from my understanding we shouldn't have 
any problems given the amount of data we store per day (currently 
approximately 2-2.5G / day).


Thank you in advance,

Bill


On Wed, Apr 11, 2012 at 7:33 PM, aaron morton 
aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote:



'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1)
according to our nodes' specification. We also changed the
'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think
the second is related to the Garbage Collection).

It's best to leave the default settings unless you know what you
are doing here.


In case you find this useful, swap is off and unevictable memory
seems to be very high on all 3 servers (2.3GB, we usually
observe the amount of unevictable memory on other Linux servers
of around 0-16KB)

Cassandra locks the java memory so it cannot be swapped out.


The problem is that the node we hit from our thrift interface
dies regularly (approximately after we store 2-2.5G of data).
Error message: OutOfMemoryError: Java Heap Space and according
to the log it in fact used all of the allocated memory.

The easiest solution will be to use a larger EC2 instance.

People normally use an m1.xlarge with 16Gb of ram (you would also
try an m1.large).

If you are still experimenting I would suggest using the larger
instances so you can make some progress. Once you have a feel for
how things work you can then try to match the instances to your
budget.

Hope that helps.

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com http://www.thelastpickle.com/

On 11/04/2012, at 1:54 AM, Vasileios Vlachos wrote:


Hello,

We are experimenting a bit with Cassandra lately (version 1.0.7)
and we seem to have some problems with memory. We use EC2 as our
test environment and we have three nodes with 3.7G of memory and
1 core @ 2.4G, all running Ubuntu server 11.10.

The problem is that the node we hit from our thrift interface
dies regularly (approximately after we store 2-2.5G of data).
Error message: OutOfMemoryError: Java Heap Space and according
to the log it in fact used all of the allocated memory.

The nodes are under relatively constant load and store about
2000-4000 row keys a minute, which are batched through the Trift
interface in 10-30 row keys at once (with about 50 columns
each). The number of reads is very low with around 1000-2000 a
day and only requesting the data of a single row key. The is
currently only one used column family.

The initial thought was that something was wrong in the
cassandra-env.sh file. So, we specified the variables
'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1)
according to our nodes' specification. We also changed the
'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think
the second is related to the Garbage Collection). Unfortunately,
that did not solve the issue and the node we hit via thrift
keeps on dying regularly.

In case you find this useful, swap is off and unevictable memory
seems to be very high on all 3 servers (2.3GB, we usually
observe the amount of unevictable memory on other Linux servers
of around 0-16KB) (We are not quite sure how the unevictable
memory ties into Cassandra, its just something we observed while
looking into the problem). The CPU is pretty much idle the
entire time. The heap memory is clearly being reduced once in a
while according to nodetool, but obviously grows over the limit
as time goes by.

Any ideas? Thanks in advance.

Bill








--

Kind regards

Re: Cassandra running out of memory?

2012-04-12 Thread Vasileios Vlachos
Hello Aaron,

Thank you for getting back to me.

I will change to m1.large first to see how long it will take Cassandra node
to die (if at all). If again not happy I will try more memory. I just want
to test it step by step and see what the differences are. I will also
change the cassandra-env file back to defaults.

Is there an absolute minimum requirement for Cassandra in terms of memory?
I might be wrong, but from my understanding we shouldn't have any problems
given the amount of data we store per day (currently approximately 2-2.5G /
day).

Thank you in advance,

Bill


On Wed, Apr 11, 2012 at 7:33 PM, aaron morton aa...@thelastpickle.comwrote:

 'system_memory_in_mb' (3760) and the 'system_cpu_cores' (1) according to
 our nodes' specification. We also changed the 'MAX_HEAP_SIZE' to 2G and the
 'HEAP_NEWSIZE' to 200M (we think the second is related to the Garbage
 Collection).

 It's best to leave the default settings unless you know what you are doing
 here.

 In case you find this useful, swap is off and unevictable memory seems to
 be very high on all 3 servers (2.3GB, we usually observe the amount of
 unevictable memory on other Linux servers of around 0-16KB)

 Cassandra locks the java memory so it cannot be swapped out.

 The problem is that the node we hit from our thrift interface dies
 regularly (approximately after we store 2-2.5G of data). Error message:
 OutOfMemoryError: Java Heap Space and according to the log it in fact used
 all of the allocated memory.

 The easiest solution will be to use a larger EC2 instance.

 People normally use an m1.xlarge with 16Gb of ram (you would also try an
 m1.large).

 If you are still experimenting I would suggest using the larger instances
 so you can make some progress. Once you have a feel for how things work you
 can then try to match the instances to your budget.

 Hope that helps.

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 11/04/2012, at 1:54 AM, Vasileios Vlachos wrote:

 Hello,

 We are experimenting a bit with Cassandra lately (version 1.0.7) and we
 seem to have some problems with memory. We use EC2 as our test environment
 and we have three nodes with 3.7G of memory and 1 core @ 2.4G, all running
 Ubuntu server 11.10.

 The problem is that the node we hit from our thrift interface dies
 regularly (approximately after we store 2-2.5G of data). Error message:
 OutOfMemoryError: Java Heap Space and according to the log it in fact used
 all of the allocated memory.

 The nodes are under relatively constant load and store about 2000-4000 row
 keys a minute, which are batched through the Trift interface in 10-30 row
 keys at once (with about 50 columns each). The number of reads is very low
 with around 1000-2000 a day and only requesting the data of a single row
 key. The is currently only one used column family.

 The initial thought was that something was wrong in the cassandra-env.sh
 file. So, we specified the variables 'system_memory_in_mb' (3760) and the
 'system_cpu_cores' (1) according to our nodes' specification. We also
 changed the 'MAX_HEAP_SIZE' to 2G and the 'HEAP_NEWSIZE' to 200M (we think
 the second is related to the Garbage Collection). Unfortunately, that did
 not solve the issue and the node we hit via thrift keeps on dying regularly.

 In case you find this useful, swap is off and unevictable memory seems to
 be very high on all 3 servers (2.3GB, we usually observe the amount of
 unevictable memory on other Linux servers of around 0-16KB) (We are not
 quite sure how the unevictable memory ties into Cassandra, its just
 something we observed while looking into the problem). The CPU is pretty
 much idle the entire time. The heap memory is clearly being reduced once in
 a while according to nodetool, but obviously grows over the limit as time
 goes by.

 Any ideas? Thanks in advance.

 Bill