Write timeout on other nodes when joing a new node (in new DC)

2015-10-20 Thread Jiri Horky
Hi all,

we are experiencing a strange behavior when we are trying to bootstrap a
new node. The problem is that the Recent Write Latency goes to 2s on all
the other Cassandra nodes (which are receiving user traffic), which
corresponds to our setting of "write_request_timeout_in_ms: 2000".

We use Cassandra 2.0.10 and trying to convert to vnodes and increase a
replication factor. So we are adding a new node in new DC (marked as
DCXA) as the only node in new DC with replication factor 3. The reason
for higher RF is that we will be converting another 2 existing servers
to new DC (vnodes) and we want them to get all the data.

The replication settings look like this:
ALTER KEYSPACE slw WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC4': '1',
  'DC5': '1',
  'DC2': '1',
  'DC3': '1',
  'DC0': '1',
  'DC1': '1',
  'DC0A': '3',
  'DC1A': '3',
  'DC2A': '3',
  'DC3A': '3',
  'DC4A': '3',
  'DC5A': '3'
};

We were adding the nodes to DC0A->DC4A without any effects on existing
nodes (DCX without A). When we are trying to add DC5A, the abovemention
problem happens, 100% reproducibly.

I tried to increase number of concurrent_writers from 32 to 128 on the
old nodes, also tried to increase number of flush writers, both  with no
effect. The strange thing is that the load, CPU usage, GC, network
throughput - everything is fine on the old nodes which are reporting 2s
of write latency. Nodetool tpstats does not show any blocked/pending
operations.

I think I must be hitting some limit (because of overall of replicas?)
somewhere.

Any input would be greatly appreciated.

Thanks
Jirka H.



Re: Write timeout on other nodes when joing a new node (in new DC)

2015-10-20 Thread Jiri Horky
Hi all,

so after deep investigation, we found out that this is this problem

https://issues.apache.org/jira/browse/CASSANDRA-8058

Jiri Horky

On 10/20/2015 12:00 PM, Jiri Horky wrote:
> Hi all,
>
> we are experiencing a strange behavior when we are trying to bootstrap a
> new node. The problem is that the Recent Write Latency goes to 2s on all
> the other Cassandra nodes (which are receiving user traffic), which
> corresponds to our setting of "write_request_timeout_in_ms: 2000".
>
> We use Cassandra 2.0.10 and trying to convert to vnodes and increase a
> replication factor. So we are adding a new node in new DC (marked as
> DCXA) as the only node in new DC with replication factor 3. The reason
> for higher RF is that we will be converting another 2 existing servers
> to new DC (vnodes) and we want them to get all the data.
>
> The replication settings look like this:
> ALTER KEYSPACE slw WITH replication = {
>   'class': 'NetworkTopologyStrategy',
>   'DC4': '1',
>   'DC5': '1',
>   'DC2': '1',
>   'DC3': '1',
>   'DC0': '1',
>   'DC1': '1',
>   'DC0A': '3',
>   'DC1A': '3',
>   'DC2A': '3',
>   'DC3A': '3',
>   'DC4A': '3',
>   'DC5A': '3'
> };
>
> We were adding the nodes to DC0A->DC4A without any effects on existing
> nodes (DCX without A). When we are trying to add DC5A, the abovemention
> problem happens, 100% reproducibly.
>
> I tried to increase number of concurrent_writers from 32 to 128 on the
> old nodes, also tried to increase number of flush writers, both  with no
> effect. The strange thing is that the load, CPU usage, GC, network
> throughput - everything is fine on the old nodes which are reporting 2s
> of write latency. Nodetool tpstats does not show any blocked/pending
> operations.
>
> I think I must be hitting some limit (because of overall of replicas?)
> somewhere.
>
> Any input would be greatly appreciated.
>
> Thanks
> Jirka H.
>



Re: Availability testing of Cassandra nodes

2015-04-09 Thread Jiri Horky
Hi Jack,

it seems there is a some misunderstanding. There are two things. One is
that the Cassandra works for application, which may (and should) be true
even if some of the nodes are actually down. The other thing is that
even in this case you want to be notified that there are faulty
Cassandra nodes.

Now I am trying to tackle the later case, I am not having issues with
how client-side load balancing works.

Jirka H.

On 04/09/2015 07:15 AM, Ajay wrote:
 Adding Java driver forum.

 Even we like to know more on this.

 -
 Ajay

 On Wed, Apr 8, 2015 at 8:15 PM, Jack Krupansky
 jack.krupan...@gmail.com mailto:jack.krupan...@gmail.com wrote:

 Just a couple of quick comments:

 1. The driver is supposed to be doing availability and load
 balancing already.
 2. If your cluster is lightly loaded, it isn't necessary to be so
 precise with load balancing.
 3. If your cluster is heavily loaded, it won't help. Solution is
 to expand your cluster so that precise balancing of requests
 (beyond what the driver does) is not required.

 Is there anything special about your use case that you feel is
 worth the extra treatment?

 If you are having problems with the driver balancing requests and
 properly detecting available nodes or see some room for
 improvement, make sure to the issues so that they can be fixed.


 -- Jack Krupansky

 On Wed, Apr 8, 2015 at 10:31 AM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 Hi all,

 we are thinking of how to best proceed with availability
 testing of
 Cassandra nodes. It is becoming more and more apparent that it
 is rather
 complex task. We thought that we should try to read and write
 to each
 cassandra node to monitoring keyspace with a unique value
 with low
 TTL. This helps to find an issue but it also triggers flapping of
 unaffected hosts, as the key of the value which is beining
 inserted
 sometimes belongs to an affected host and sometimes not. Now,
 we could
 calculate the right value to insert so we can be sure it will
 hit the
 host we are connecting to, but then, you have replication
 factor and
 consistency level, so you can not be really sure that it
 actually tests
 ability of the given host to write values.

 So we ended up thinking that the best approach is to connect
 to each
 individual host, read some system keyspace (which might be on a
 different disk drive...), which should be local, and then
 check several
 JMX values that could indicate an error + JVM statitics (full
 heap, gc
 overhead). Moreover, we will more monitor our applications
 that are
 using cassandra (with mostly datastax driver) and try to get
 fail node
 information from them.

 How others do the testing?

 Jirka H.






Availability testing of Cassandra nodes

2015-04-08 Thread Jiri Horky
Hi all,

we are thinking of how to best proceed with availability testing of
Cassandra nodes. It is becoming more and more apparent that it is rather
complex task. We thought that we should try to read and write to each
cassandra node to monitoring keyspace with a unique value with low
TTL. This helps to find an issue but it also triggers flapping of
unaffected hosts, as the key of the value which is beining inserted
sometimes belongs to an affected host and sometimes not. Now, we could
calculate the right value to insert so we can be sure it will hit the
host we are connecting to, but then, you have replication factor and
consistency level, so you can not be really sure that it actually tests
ability of the given host to write values.

So we ended up thinking that the best approach is to connect to each
individual host, read some system keyspace (which might be on a
different disk drive...), which should be local, and then check several
JMX values that could indicate an error + JVM statitics (full heap, gc
overhead). Moreover, we will more monitor our applications that are
using cassandra (with mostly datastax driver) and try to get fail node
information from them.

How others do the testing?

Jirka H.


Re: High GC activity on node with 4TB on data

2015-02-12 Thread Jiri Horky
Number of cores: 2x6Cores x 2(HT).

I do agree with you that the the hardware is certainly overestimated for
just one Cassandra, but we got a very good price since we ordered
several 10s of the same nodes for a different project. That's why we use
for multiple cassandra instances.

Jirka H.

On 02/12/2015 04:18 PM, Eric Stevens wrote:
  each node has 256G of memory, 24x1T drives, 2x Xeon CPU

 I don't have first hand experience running Cassandra on such massive
 hardware, but it strikes me that these machines are dramatically
 oversized to be good candidates for Cassandra (though I wonder how
 many cores are in those CPUs; I'm guessing closer to 18 than 2 based
 on the other hardware).

 A larger cluster of smaller hardware would be a much better shape for
 Cassandra.  Or several clusters of smaller hardware since you're
 running multiple instances on this hardware - best practices have one
 instance per host no matter the hardware size.

 On Thu, Feb 12, 2015 at 12:36 AM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 Hi Chris,

 On 02/09/2015 04:22 PM, Chris Lohfink wrote:
  - number of tombstones - how can I reliably find it out?
 https://github.com/spotify/cassandra-opstools
 https://github.com/cloudian/support-tools
 thanks.

 If not getting much compression it may be worth trying to disable
 it, it may contribute but its very unlikely that its the cause of
 the gc pressure itself.

 7000 sstables but STCS? Sounds like compactions couldn't keep
 up.  Do you have a lot of pending compactions (nodetool)?  You
 may want to increase your compaction throughput (nodetool) to see
 if you can catch up a little, it would cause a lot of heap
 overhead to do reads with that many.  May even need to take more
 drastic measures if it cant catch back up.
 I am sorry, I was wrong. We actually do use LCS (the switch was
 done recently). There are almost none pending compaction. We have
 increased the size sstable to 768M, so it should help as as well.


 May also be good to check `nodetool cfstats` for very wide
 partitions.  
 There are basically none, this is fine.

 It seems that the problem really comes from having so much data in
 so many sstables, so
 org.apache.cassandra.io.compress.CompressedRandomAccessReader
 classes consumes more memory than 0.75*HEAP_SIZE, which triggers
 the CMS over and over.

 We have turned off the compression and so far, the situation seems
 to be fine.

 Cheers
 Jirka H.



 Theres a good chance if under load and you have over 8gb heap
 your GCs could use tuning.  The bigger the nodes the more manual
 tweaking it will require to get the most out of
 them https://issues.apache.org/jira/browse/CASSANDRA-8150 also
 has some ideas.

 Chris

 On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 Hi all,

 thank you all for the info.

 To answer the questions:
  - we have 2 DCs with 5 nodes in each, each node has 256G of
 memory, 24x1T drives, 2x Xeon CPU - there are multiple
 cassandra instances running for different project. The node
 itself is powerful enough.
  - there 2 keyspaces, one with 3 replicas per DC, one with 1
 replica per DC (because of amount of data and because it
 serves more or less like a cache)
  - there are about 4k/s Request-response, 3k/s Read and 2k/s
 Mutation requests  - numbers are sum of all nodes
  - we us STCS (LCS would be quite IO have for this amount of
 data)
  - number of tombstones - how can I reliably find it out?
  - the biggest CF (3.6T per node) has 7000 sstables

 Now, I understand that the best practice for Cassandra is to
 run with the minimum size of heap which is enough which for
 this case we thought is about 12G - there is always 8G
 consumbed by the SSTable readers. Also, I though that high
 number of tombstones create pressure in the new space (which
 can then cause pressure in old space as well), but this is
 not what we are seeing. We see continuous GC activity in Old
 generation only.

 Also, I noticed that the biggest CF has Compression factor of
 0.99 which basically means that the data come compressed
 already. Do you think that turning off the compression should
 help with memory consumption?

 Also, I think that tuning CMSInitiatingOccupancyFraction=75
 might help here, as it seems that 8G is something that
 Cassandra needs for bookkeeping this amount of data and that
 this was sligtly above the 75% limit which triggered the CMS
 again and again.

 I will definitely have a look at the presentation.

 Regards
 Jiri Horky


 On 02/08/2015 10:32

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Well, I always wondered how Cassandra can by used in Hadoop-like
environment where you basically need to do full table scan.

I need to say that our experience is that cassandra is perfect for
writing, reading specific values by key, but definitely not for reading
all of the data out of it. Some of our projects found out that doing
that with a not trivial in a timely manner is close to impossible in
many situations. We are slowly moving to storing the data in HDFS and
possibly reprocess them on a daily bases for such usecases (statistics).

This is nothing against Cassandra, it can not be perfect for everything.
But I am really interested how it can work well with Spark/Hadoop where
you basically needs to read all the data as well (as far as I understand
that).

Jirka H.

On 02/11/2015 01:51 PM, DuyHai Doan wrote:
 The very nature of cassandra's distributed nature vs partitioning
 data on hadoop makes spark on hdfs actually fasted than on cassandra

 Prove it. Did you ever have a look into the source code of the
 Spark/Cassandra connector to see how data locality is achieved before
 throwing out such statement ?

 On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON)
 mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote:

  cassandra makes a very poor datawarehouse ot long term time series store

 Really? This is not the impression I have... I think Cassandra is
 good to store larges amounts of data and historical information,
 it's only not good to store temporary data.
 Netflix has a large amount of data and it's all stored in
 Cassandra, AFAIK.

  The very nature of cassandra's distributed nature vs partitioning data 
 on hadoop makes spark on hdfs
 actually fasted than on cassandra.

 I am not sure about the current state of Spark support for
 Cassandra, but I guess if you create a map reduce job, the
 intermediate map results will be still stored in HDFS, as it
 happens to hadoop, is this right? I think the problem with Spark +
 Cassandra or with Hadoop + Cassandra is that the hard part spark
 or hadoop does, the shuffling, could be done out of the box with
 Cassandra, but no one takes advantage on that. What if a map /
 reduce job used a temporary CF in Cassandra to store intermediate
 results?

 From: user@cassandra.apache.org mailto:user@cassandra.apache.org
 Subject: Re: How to speed up SELECT * query in Cassandra

 I use spark with cassandra, and you dont need DSE.

 I see a lot of people ask this same question below (how do I
 get a lot of data out of cassandra?), and my question is
 always, why arent you updating both places at once?

 For example, we use hadoop and cassandra in conjunction with
 each other, we use a message bus to store every event in both,
 aggregrate in both, but only keep current data in cassandra
 (cassandra makes a very poor datawarehouse ot long term time
 series store) and then use services to process queries that
 merge data from hadoop and cassandra.  

 Also, spark on hdfs gives more flexibility in terms of large
 datasets and performance.  The very nature of cassandra's
 distributed nature vs partitioning data on hadoop makes spark
 on hdfs actually fasted than on cassandra



 -- 
 *Colin Clark* 
 +1 612 859 6129 tel:%2B1%20612%20859%206129
 Skype colin.p.clark

 On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se
 mailto:jens.ran...@tink.se wrote:


 On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/
 LONDON) mvallemil...@bloomberg.net
 mailto:mvallemil...@bloomberg.net wrote:

 If you use Cassandra enterprise, you can use hive, AFAIK.


 Even better, you can use Spark/Shark with DSE.

 Cheers,
 Jens


 -- 
 Jens Rantil
 Backend engineer
 Tink AB

 Email: jens.ran...@tink.se mailto:jens.ran...@tink.se
 Phone: +46 708 84 18 32
 Web: www.tink.se http://www.tink.se/

 Facebook https://www.facebook.com/#%21/tink.se Linkedin
 
 http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary
  Twitter
 https://twitter.com/tink






Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

loop {lastRow =val query = lastRow match {case Some(row) =
nextPageQuery(row, upperLimit)case None =
initialQuery(lowerLimit)}session.execute(query).all}


private def nextPageQuery(row: Row, upperLimit: String): String = {val
tokenPart = token(%s)  token(0x%s) and token(%s) 
%s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName,
upperLimit)basicQuery.format(tokenPart)}


private def initialQuery(lowerLimit: String): String = {val tokenPart =
token(%s) = %s.format(rowKeyName,
lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges:
(BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) =
{tokenRange match {case Some((start, end)) =Logger.info(Token range
given: {},  + start.underlying.toPlainString + ,  +
end.underlying.toPlainString + )val tokenSpaceSize = end - startval
rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until
concurrency) yield (start + (i * rangeSize), start + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =val
tokenSpaceSize = partitioner.max - partitioner.minval rangeSize =
tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency)
yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)}}

private val basicQuery = {select %s, %s, %s, writetime(%s) from %s
where %s%s limit
%d%s.format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,%s,
// templatewhereCondition,pageSize,if (cqlAllowFiltering)  allow
filtering else )}


case object Murmur3 extends Partitioner {override val min =
BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case
object Random extends Partitioner {override val min =
BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1}


On 02/11/2015 02:21 PM, Ja Sam wrote:
 Your answer looks very promising

  How do you calculate start and stop?

 On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 The fastest way I am aware of is to do the queries in parallel to
 multiple cassandra nodes and make sure that you only ask them for keys
 they are responsible for. Otherwise, the node needs to resend your
 query
 which is much slower and creates unnecessary objects (and thus GC
 pressure).

 You can manually take advantage of the token range information, if the
 driver does not get this into account for you. Then, you can play with
 concurrency and batch size of a single query against one node.
 Basically, what you/driver should do is to transform the query to
 series
 of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

 I will need to look up the actual code, but the idea should be
 clear :)

 Jirka H.


 On 02/11/2015 11:26 AM, Ja Sam wrote:
  Is there a simple way (or even a complicated one) how can I speed up
  SELECT * FROM [table] query?
  I need to get all rows form one table every day. I split tables, and
  create one for each day, but still query is quite slow (200 millions
  of records)
 
  I was thinking about run this query in parallel, but I don't know if
  it is possible





Re: High GC activity on node with 4TB on data

2015-02-11 Thread Jiri Horky
Hi Chris,

On 02/09/2015 04:22 PM, Chris Lohfink wrote:
  - number of tombstones - how can I reliably find it out?
 https://github.com/spotify/cassandra-opstools
 https://github.com/cloudian/support-tools
thanks.

 If not getting much compression it may be worth trying to disable it,
 it may contribute but its very unlikely that its the cause of the gc
 pressure itself.

 7000 sstables but STCS? Sounds like compactions couldn't keep up.  Do
 you have a lot of pending compactions (nodetool)?  You may want to
 increase your compaction throughput (nodetool) to see if you can catch
 up a little, it would cause a lot of heap overhead to do reads with
 that many.  May even need to take more drastic measures if it cant
 catch back up.
I am sorry, I was wrong. We actually do use LCS (the switch was done
recently). There are almost none pending compaction. We have increased
the size sstable to 768M, so it should help as as well.


 May also be good to check `nodetool cfstats` for very wide partitions.  
There are basically none, this is fine.

It seems that the problem really comes from having so much data in so
many sstables, so
org.apache.cassandra.io.compress.CompressedRandomAccessReader classes
consumes more memory than 0.75*HEAP_SIZE, which triggers the CMS over
and over.

We have turned off the compression and so far, the situation seems to be
fine.

Cheers
Jirka H.


 Theres a good chance if under load and you have over 8gb heap your GCs
 could use tuning.  The bigger the nodes the more manual tweaking it
 will require to get the most out of
 them https://issues.apache.org/jira/browse/CASSANDRA-8150 also has
 some ideas.

 Chris

 On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 Hi all,

 thank you all for the info.

 To answer the questions:
  - we have 2 DCs with 5 nodes in each, each node has 256G of
 memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra
 instances running for different project. The node itself is
 powerful enough.
  - there 2 keyspaces, one with 3 replicas per DC, one with 1
 replica per DC (because of amount of data and because it serves
 more or less like a cache)
  - there are about 4k/s Request-response, 3k/s Read and 2k/s
 Mutation requests  - numbers are sum of all nodes
  - we us STCS (LCS would be quite IO have for this amount of data)
  - number of tombstones - how can I reliably find it out?
  - the biggest CF (3.6T per node) has 7000 sstables

 Now, I understand that the best practice for Cassandra is to run
 with the minimum size of heap which is enough which for this
 case we thought is about 12G - there is always 8G consumbed by the
 SSTable readers. Also, I though that high number of tombstones
 create pressure in the new space (which can then cause pressure in
 old space as well), but this is not what we are seeing. We see
 continuous GC activity in Old generation only.

 Also, I noticed that the biggest CF has Compression factor of 0.99
 which basically means that the data come compressed already. Do
 you think that turning off the compression should help with memory
 consumption?

 Also, I think that tuning CMSInitiatingOccupancyFraction=75 might
 help here, as it seems that 8G is something that Cassandra needs
 for bookkeeping this amount of data and that this was sligtly
 above the 75% limit which triggered the CMS again and again.

 I will definitely have a look at the presentation.

 Regards
 Jiri Horky


 On 02/08/2015 10:32 PM, Mark Reddy wrote:
 Hey Jiri, 

 While I don't have any experience running 4TB nodes (yet), I
 would recommend taking a look at a presentation by Arron Morton
 on large
 nodes: 
 http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/
 to see if you can glean anything from that.

 I would note that at the start of his talk he mentions that in
 version 1.2 we can now talk about nodes around 1 - 3 TB in size,
 so if you are storing anything more than that you are getting
 into very specialised use cases.

 If you could provide us with some more information about your
 cluster setup (No. of CFs, read/write patterns, do you delete /
 update often, etc.) that may help in getting you to a better place.


 Regards,
 Mark

 On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com
 mailto:bur...@spinn3r.com wrote:

 Do you have a lot of individual tables?  Or lots of small
 compactions?

 I think the general consensus is that (at least for
 Cassandra), 8GB heaps are ideal.  

 If you have lots of small tables it’s a known anti-pattern (I
 believe) because the Cassandra internals could do a better
 job on handling the in memory metadata representation.

 I think this has

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
The fastest way I am aware of is to do the queries in parallel to
multiple cassandra nodes and make sure that you only ask them for keys
they are responsible for. Otherwise, the node needs to resend your query
which is much slower and creates unnecessary objects (and thus GC pressure).

You can manually take advantage of the token range information, if the
driver does not get this into account for you. Then, you can play with
concurrency and batch size of a single query against one node.
Basically, what you/driver should do is to transform the query to series
of SELECT * FROM TABLE WHERE TOKEN IN (start, stop).

I will need to look up the actual code, but the idea should be clear :)

Jirka H.


On 02/11/2015 11:26 AM, Ja Sam wrote:
 Is there a simple way (or even a complicated one) how can I speed up
 SELECT * FROM [table] query?
 I need to get all rows form one table every day. I split tables, and
 create one for each day, but still query is quite slow (200 millions
 of records)

 I was thinking about run this query in parallel, but I don't know if
 it is possible



Re: High GC activity on node with 4TB on data

2015-02-09 Thread Jiri Horky
Hi all,

thank you all for the info.

To answer the questions:
 - we have 2 DCs with 5 nodes in each, each node has 256G of memory,
24x1T drives, 2x Xeon CPU - there are multiple cassandra instances
running for different project. The node itself is powerful enough.
 - there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per
DC (because of amount of data and because it serves more or less like a
cache)
 - there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation
requests  - numbers are sum of all nodes
 - we us STCS (LCS would be quite IO have for this amount of data)
 - number of tombstones - how can I reliably find it out?
 - the biggest CF (3.6T per node) has 7000 sstables

Now, I understand that the best practice for Cassandra is to run with
the minimum size of heap which is enough which for this case we thought
is about 12G - there is always 8G consumbed by the SSTable readers.
Also, I though that high number of tombstones create pressure in the new
space (which can then cause pressure in old space as well), but this is
not what we are seeing. We see continuous GC activity in Old generation
only.

Also, I noticed that the biggest CF has Compression factor of 0.99 which
basically means that the data come compressed already. Do you think that
turning off the compression should help with memory consumption?

Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help
here, as it seems that 8G is something that Cassandra needs for
bookkeeping this amount of data and that this was sligtly above the 75%
limit which triggered the CMS again and again.

I will definitely have a look at the presentation.

Regards
Jiri Horky

On 02/08/2015 10:32 PM, Mark Reddy wrote:
 Hey Jiri, 

 While I don't have any experience running 4TB nodes (yet), I would
 recommend taking a look at a presentation by Arron Morton on large
 nodes: 
 http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/
 to see if you can glean anything from that.

 I would note that at the start of his talk he mentions that in version
 1.2 we can now talk about nodes around 1 - 3 TB in size, so if you are
 storing anything more than that you are getting into very specialised
 use cases.

 If you could provide us with some more information about your cluster
 setup (No. of CFs, read/write patterns, do you delete / update often,
 etc.) that may help in getting you to a better place.


 Regards,
 Mark

 On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com
 mailto:bur...@spinn3r.com wrote:

 Do you have a lot of individual tables?  Or lots of small
 compactions?

 I think the general consensus is that (at least for Cassandra),
 8GB heaps are ideal.  

 If you have lots of small tables it’s a known anti-pattern (I
 believe) because the Cassandra internals could do a better job on
 handling the in memory metadata representation.

 I think this has been improved in 2.0 and 2.1 though so the fact
 that you’re on 1.2.18 could exasperate the issue.  You might want
 to consider an upgrade (though that has its own issues as well).

 On Sun, Feb 8, 2015 at 12:44 PM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 Hi all,

 we are seeing quite high GC pressure (in old space by CMS GC
 Algorithm)
 on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap
 memory
 (2G for new space). The node runs fine for couple of days when
 the GC
 activity starts to raise and reaches about 15% of the C*
 activity which
 causes dropped messages and other problems.

 Taking a look at heap dump, there is about 8G used by
 SSTableReader
 classes in
 org.apache.cassandra.io.compress.CompressedRandomAccessReader.

 Is this something expected and we have just reached the limit
 of how
 many data a single Cassandra instance can handle or it is
 possible to
 tune it better?

 Regards
 Jiri Horky




 -- 
 Founder/CEO Spinn3r.com http://Spinn3r.com
 Location: *San Francisco, CA*
 blog:* *http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com





High GC activity on node with 4TB on data

2015-02-08 Thread Jiri Horky
Hi all,

we are seeing quite high GC pressure (in old space by CMS GC Algorithm)
on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap memory
(2G for new space). The node runs fine for couple of days when the GC
activity starts to raise and reaches about 15% of the C* activity which
causes dropped messages and other problems.

Taking a look at heap dump, there is about 8G used by SSTableReader
classes in org.apache.cassandra.io.compress.CompressedRandomAccessReader.

Is this something expected and we have just reached the limit of how
many data a single Cassandra instance can handle or it is possible to
tune it better?

Regards
Jiri Horky


Re: Node down during move

2014-12-23 Thread Jiri Horky
Hi,

just a follow up. We've seen this behavior multiple times now. It seems
that the receiving node loses connectivity to the cluster and thus
thinks that it is the sole online node, whereas the rest of the cluster
thinks that it is the only offline node, really just after the streaming
is over. I am not sure what causes that, but it is reproducible. Restart
of the affected node helps.

We have 3 datacenters (RF=1 for each datacenter) where we are moving the
tokens. This happens only in one of them.

Regards
Jiri Horky


On 12/19/2014 08:20 PM, Jiri Horky wrote:
 Hi list,

 we added a new node to existing 8-nodes cluster with C* 1.2.9 without
 vnodes and because we are almost totally out of space, we are shuffling
 the token fone node after another (not in parallel). During one of this
 move operations, the receiving node died and thus the streaming failed:

  WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
 StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
  INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
 ColumnFamilyStore.java (line 629) Enqueuing flush of
 Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
  INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
 ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
 CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
 /X.Y.Z.18:2,5,RMI Runtime]
 java.lang.RuntimeException: java.io.IOException: Broken pipe
 at com.google.common.base.Throwables.propagate(Throwables.java:160)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: java.io.IOException: Broken pipe
 at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

 After restart of the receiving node, we tried to perform the move again,
 but it failed with:

 Exception in thread main java.io.IOException: target token
 113427455640312821154458202477256070486 is already owned by another node.
 at
 org.apache.cassandra.service.StorageService.move(StorageService.java:2930)

 So we tried to move it with a token just 1 higher, to trigger the
 movement. This didn't move anything, but finished successfully:

  INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
 from /X.Y.Z.18

 Now, it is quite improbable that the first streaming was done and it
 died just after copying everything, as the ERROR was the last message
 about streaming in the logs. Is there any way how to make sure the data
 are really moved and thus running nodetool cleanup is safe?

 Thank you.
 Jiri Hoky



Node down during move

2014-12-19 Thread Jiri Horky
Hi list,

we added a new node to existing 8-nodes cluster with C* 1.2.9 without
vnodes and because we are almost totally out of space, we are shuffling
the token fone node after another (not in parallel). During one of this
move operations, the receiving node died and thus the streaming failed:

 WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
 INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
ColumnFamilyStore.java (line 629) Enqueuing flush of
Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
 INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
/X.Y.Z.18:2,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

After restart of the receiving node, we tried to perform the move again,
but it failed with:

Exception in thread main java.io.IOException: target token
113427455640312821154458202477256070486 is already owned by another node.
at
org.apache.cassandra.service.StorageService.move(StorageService.java:2930)

So we tried to move it with a token just 1 higher, to trigger the
movement. This didn't move anything, but finished successfully:

 INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
from /X.Y.Z.18

Now, it is quite improbable that the first streaming was done and it
died just after copying everything, as the ERROR was the last message
about streaming in the logs. Is there any way how to make sure the data
are really moved and thus running nodetool cleanup is safe?
   
Thank you.
Jiri Hoky


Fail to reconnect to other nodes after intermittent network failure

2014-08-05 Thread Jiri Horky
$Segment.lockedGetOrLoad(LocalCache.java:2337)
at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
... 20 more
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
at
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105)
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930)
at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815)
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140)
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245)
... 29 more

These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not
reapper after that), and cassandra seemed  to be happy as it marked the
other two nodes as up for a while. But after a few tens of seconds, it
marked them as DOWN again and kept it in this state until restart three
days later, even the network connectivity was stable:

 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /77.234.42.20 is now UP
 INFO [HintedHandoff:2] 2014-08-01 09:02:37,305
HintedHandOffManager.java (line 296) Started hinted handoff for host:
9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /77.234.42.20
 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /77.234.44.20 is now UP
 INFO [HintedHandoff:1] 2014-08-01 09:02:37,308
HintedHandOffManager.java (line 296) Started hinted handoff for host:
97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /77.234.44.20
 INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java
(line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968
serialized/live bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line
398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live
bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line
443) Completed flushing; nothing needed to be retained.  Commitlog
position was ReplayPosition(segmentId=1403712545417, position=20482758)
 INFO [HintedHandoff:1] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.44.20; aborting (0 delivered)
 INFO [HintedHandoff:2] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.42.20; aborting (0 delivered)
 INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823)
InetAddress /77.234.44.20 is now DOWN
 INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823)
InetAddress mia10.ff.avast.com/77.234.42.20 is now DOWN

After one day, the node started to discards hints it saved for the other
nodes.

 WARN [OptionalTasks:1] 2014-08-02 09:09:10,277
HintedHandoffMetrics.java (line 78) /77.234.44.20 has 9039 dropped
hints, because node is down past configured hint window.
 WARN [OptionalTasks:1] 2014-08-02 09:09:10,290
HintedHandoffMetrics.java (line 78) /77.234.42.20 has 8816 dropped
hints, because node is down past configured hint window.


The nodetool status marked the other two nodes as down. After restart,
it reconnected to the cluster without any problem.

What puzzles me is the fact that the authentization apparently started
to work after the network recovered but the exchange of data did not.

I would like to understand what could caused the problems and how to
avoid  them in the future.

Any pointers would be appreciated.

Regards
Jiri Horky



Re: Fail to reconnect to other nodes after intermittent network failure

2014-08-05 Thread Jiri Horky
OK, ticket 7696 [1] created.

Jiri Horky

https://issues.apache.org/jira/browse/CASSANDRA-7696

On 08/05/2014 07:57 PM, Robert Coli wrote:

 On Tue, Aug 5, 2014 at 5:48 AM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 What puzzles me is the fact that the authentization apparently started
 to work after the network recovered but the exchange of data did not.

 I would like to understand what could caused the problems and how to
 avoid  them in the future.


 Very few people use SSL and very few people use auth, you have
 probably hit an edge case.

 I would file a JIRA with the details you described above.

 =Rob
  



Re: upgrade from cassandra 1.2.3 - 1.2.13 + start using SSL

2014-01-11 Thread Jiri Horky
Thank you both for the answers!

Jiri Horky

On 01/10/2014 02:52 AM, Aaron Morton wrote:
 We avoid mixing versions for a long time, but we always upgrade one
 node and check the application is happy before proceeding. e.g. wait
 for 30 minutes before upgrading the others. 

 If you snapshot before upgrading, and have to roll back after 30
 minutes you can roll back to the snapshot and use repair to fix the
 data on disk. 

 Hope that helps. 

 -
 Aaron Morton
 New Zealand
 @aaronmorton

 Co-Founder  Principal Consultant
 Apache Cassandra Consulting
 http://www.thelastpickle.com

 On 9/01/2014, at 7:24 am, Robert Coli rc...@eventbrite.com
 mailto:rc...@eventbrite.com wrote:

 On Wed, Jan 8, 2014 at 1:17 AM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 I am specifically interested whether is possible to upgrade just one
 node and keep it running like that for some time, i.e. if the gossip
 protocol is compatible in both directions. We are a bit afraid to
 upgrade all nodes to 1.2.13 at once in a case we would need to
 rollback.


 This not not officially supported. It will probably work for these
 particular versions, but it is not recommended.

 The most serious potential issue is an inability to replace the new
 node if it fails. There's also the problem of not being able to
 repair until you're back on the same versions. And other, similar,
 undocumented edge cases...

 =Rob





upgrade from cassandra 1.2.3 - 1.2.13 + start using SSL

2014-01-08 Thread Jiri Horky
Hi all,

I would appreciate an advice whether is a good idea to upgrade from
cassandra 1.2.3 to 1.2.13 and how to best proceed. The particular
cluster consists of 3 nodes (each one in a different DC having 1
replica) with a relativelly low traffic and 10GB load per node.

I am specifically interested whether is possible to upgrade just one
node and keep it running like that for some time, i.e. if the gossip
protocol is compatible in both directions. We are a bit afraid to
upgrade all nodes to 1.2.13 at once in a case we would need to rollback.

I know that the sstable format changed in 1.2.5 so in case we need to
rollback, the newly written data would need to be synchronized from the
old servers.

Also, after the migration to 1.2.13, we would like to start using
node-to-node encryption. I imagine that you need to configure it on all
nodes at once, so it would require a small outage.

Thank you in advance
Jiri Horky



Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-12 Thread Jiri Horky
Hi,
On 11/12/2013 05:29 AM, Aaron Morton wrote:
 Are you doing large slices or do could you have a lot of tombstones
 on the rows ? 
 don't really know - how can I monitor that?
 For tombstones, do you do a lot of deletes ? 
 Also in v2.0.2 cfstats has this 

 Average live cells per slice (last five minutes): 0.0
 Average tombstones per slice (last five minutes): 0.0

 For large slices you need to check your code. e.g. do you anything
 that reads lots of columns or very large columns or lets the user
 select how many columns to read?

 The org.apache.cassandra.db.ArrayBackedSortedColumns in the trace back
 is used during reads (.e.g.
 org.apache.cassandra.db.filter.SliceQueryFilter)
thanks for explanation, will try to provide some figures (but
unfortunately not from the 2.0.2).

 You probably want the heap to be 4G to 8G in size, 10G will
 encounter longer pauses. 
 Also the size of the new heap may be too big depending on the number
 of cores. I would recommend trying 800M
 I tried to decrease it first to 384M then to 128M with no change in
 the behaviour. I don't really care extra memory overhead of the cache
 - to be able to actual point to it with objects, but I don't really
 see the reason why it should create/delete those many objects so
 quickly. 
 Not sure what you changed to 384M.
Sorry for the confusion. I meant to say that I tried to decrease row
cache size to 384M and then to 128M and the GC times did not change at
all (still ~30% of the time).

 Shows the heap growing very quickly. This could be due to wide reads
 or a high write throughput. 
 Well, both prg01 and prg02 receive the same load which is about
 ~150-250 (during peak) read requests per seconds and 100-160 write
 requests per second. The only with heap growing rapidly and GC
 kicking in is on nodes with row cache enabled.

 This sounds like on a row cache miss cassandra is reading the whole
 row, which happens to be a wide row. I would also guess some writes
 are going to the rows and they are getting invalidated out of the row
 cache. 

 The row cache is not great for rows the update frequently and/or wide
 rows. 

 How big are the rows ? use nodetool cfstats and nodetool cfhistograms.
I will get in touch with the developers and take the data from cf*
commands in a few days (I am out of office for some days).

Thanks for the pointers, will get in touch.

Cheers
Jiri Horky



Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-08 Thread Jiri Horky
Hi,

On 11/07/2013 05:18 AM, Aaron Morton wrote:
 Class Name  
| Shallow Heap | Retained Heap
 ---
  
   |  |  
 java.nio.HeapByteBuffer @ 0x7806a0848
   |   48 |80
 '- name org.apache.cassandra.db.Column @ 0x7806424e8
|   32 |   112
|- [338530] java.lang.Object[540217] @ 0x57d62f560 Unreachable
   |2,160,888 | 2,160,888
|- [338530] java.lang.Object[810325] @ 0x591546540
   |3,241,320 | 7,820,328
|  '- elementData java.util.ArrayList @ 0x75e8424c0  
|   24 | 7,820,352
| |- list
 org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator @
 0x5940e0b18  |   48 |   128
| |  '- val$filteredIter
 org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48  
   |   32 | 7,820,568
| | '- val$iter
 org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68
 Unreachable   |   24 | 7,820,592
| |- this$0, parent java.util.ArrayList$SubList @ 0x5940e0bb8
|   40 |40
| |  '- this$1 java.util.ArrayList$SubList$1 @ 0x5940e0be0
   |   40 |80
| | '- currentSlice
 org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator @
 0x5940e0b18|   48 |   128
| |'- val$filteredIter
 org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48  
 |   32 | 7,820,568
| |   '- val$iter
 org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68
 Unreachable |   24 | 7,820,592
| |- columns org.apache.cassandra.db.ArrayBackedSortedColumns
 @ 0x5b0a33488  |   32 |56
| |  '- val$cf
 org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48  
 |   32 | 7,820,568
| | '- val$iter
 org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68
 Unreachable   |   24 | 7,820,592
| '- Total: 3 entries
|  |  
|- [338530] java.lang.Object[360145] @ 0x7736ce2f0 Unreachable
   |1,440,600 | 1,440,600
'- Total: 3 entries  
|  |  

 Are you doing large slices or do could you have a lot of tombstones on
 the rows ?
don't really know - how can I monitor that?

 We have disabled row cache on one node to see  the  difference. Please
 see attached plots from visual VM, I think that the effect is quite
 visible.
 The default row cache is of the JVM heap, have you changed to
 the ConcurrentLinkedHashCacheProvider ?
Answered by Chris already :) No.

 One way the SerializingCacheProvider could impact GC is if the CF
 takes a lot of writes. The SerializingCacheProvider invalidates the
 row when it is written to and had to read the entire row and serialise
 it on a cache miss.

 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G
 -Xmn1024M -XX:+HeapDumpOnOutOfMemoryError
 You probably want the heap to be 4G to 8G in size, 10G will encounter
 longer pauses. 
 Also the size of the new heap may be too big depending on the number
 of cores. I would recommend trying 800M
I tried to decrease it first to 384M then to 128M with no change in the
behaviour. I don't really care extra memory overhead of the cache - to
be able to actual point to it with objects, but I don't really see the
reason why it should create/delete those many objects so quickly.


 prg01.visual.vm.png
 Shows the heap growing very quickly. This could be due to wide reads
 or a high write throughput.
Well, both prg01 and prg02 receive the same load which is about ~150-250
(during peak) read requests per seconds and 100-160 write requests per
second. The only with heap growing rapidly and GC kicking in is on nodes
with row cache enabled.


 Hope that helps.
Thank you!

Jiri Horky



Cassandra 1.2.9 - 2.0.0 update can't

2013-11-05 Thread Jiri Horky
Hi,

I know it is Not Supported and also Not Advised, but currently we have
half of our cassandra cluster on 1.2.9 and the other half on 2.0.0 as we
got stuck during upgrade. We know already that it was not very wise
decision but that's what we've got now.

I have basically two questions. First, it would really help us if there
was a way how to make request to newer cassandra work with data on older
cassandra. We get
Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not
enough replica available for query at consistency ONE (1 required but
only 0 alive),

which is surely caused by the fact that the newer nodes see the older as
DOWN in nodetool ring. We solve this problem by connecting to the
older nodes which see the whole cluster as online but it would help us
deploying newer version of our application if could connect to 2.0.0 as
well.

Second, to avoid any future problems when upgrading, is there any
documentation on what is incompatible between 1.2.x line and 2.0.x, what
is not supposed to work together and what are the recommended steps? I
don't see any warning regarding incompatibility of the gossip (?)
protocol between 2.0.0 and 1.2.9 causing the described problem in the
official documentation Changes impacting upgrade.

Thank you
Jiri Horky


Re: Recompacting all sstables

2013-11-02 Thread Jiri Horky
Hi,

On 11/01/2013 09:15 PM, Robert Coli wrote:
 On Fri, Nov 1, 2013 at 12:47 PM, Jiri Horky ho...@avast.com
 mailto:ho...@avast.com wrote:

 since we upgraded half of our Cassandra cluster to 2.0.0 and we
 use LCS,
 we hit CASSANDRA-6284 bug.


 1) Why upgrade a cluster to 2.0.0? Hopefully not a production cluster? [1]
I think you already guessed the answer :) It is a production cluster, we
needed some features (particularly, compare and set) only present in 2.0
because of the applications. Besides, somebody had to discover the
regression, right? :) Thanks for the link.

 3) What do you mean by upgraded half of our Cassandra cluster? That
 is Not Supported and also Not Advised... for example, before the
 streaming change in 2.x line, a cluster in such a state may be unable
 to have nodes added, removed or replaced.
We are in the middle of the migration from 1.2.9 to 2.0 when we are also
upgrading our application which can only be run against 2.0 due  to
various technical details. It is rather hard to explain, but we hoped it
will last just for few days and it is definitely not the status we
wanted to keep. Since we hit the bug, we got stalled in the middle of
the migration.

 So the question. What is the best way to recompact all the sstables so
 the data in one sstables within a level would contain more or less the
 right portion of the data

 ... 

 Based on documentation, I can only think of switching to SizeTiered
 compaction, doing major compaction and then switching back to LCS.


 That will work, though be aware of  the implication of CASSANDRA-6092
 [2]. Briefly, if the CF in question is not receiving write load, you
 will be unable to promote your One Big SSTable from L0 to L1. In that
 case, you might want to consider running sstable_split (and then
 restarting the node) in order to split your One Big SSTable into two
 or more smaller ones.
Hmm, thinking about it a bit more, I am unsure this will actually help.
If I understand things correctly, assuming uniform distribution of new
received keys in L0 (ensured by RandomPartitioner), in order for LCS to
work optimally, I need:

a) get uniform distribution of keys across sstables in one level, i.e.
in every level each sstable will cover more or less the same range of keys
b) sstables in each level should cover almost whole space of keys the
node is responsible for
c) propagate sstables to higher levels in uniform fashion, e.g.
round-robin or random (over time, the probability of choosing an
sstables as candidate should be the same for all sstables in the level)

By splitting the sorted Big SStable, I will get a bunch of
non-overlapping sstables. So I will surely achieve a). Point c) is fixed
by the patch. But what about b)? It probably depends on order of
compaction across levels, i.e. whether the compactions in various levels
are being run in parallel and interleaved or not. In case it compacts
all the tables from one level and only after that starts to compact
sstables in higher level etc, one will end up in very similar situation
as caused by the referenced bug (because of round robin fashion of
choosing candidates), i.e. having the biggest keys in L1 and smallest
keys in the highest level. So in this case, it would actually not help
at all.

Does it make sense or am I completely wrong? :)

BTW: Not very though-out idea, but wouldn't it actually be better to
select candidates completely randomly?

Cheers
Jiri Horky


 =Rob

 [1] https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
 [2] https://issues.apache.org/jira/browse/CASSANDRA-6092



Recompacting all sstables

2013-11-01 Thread Jiri Horky
Hi all

since we upgraded half of our Cassandra cluster to 2.0.0 and we use LCS,
we hit CASSANDRA-6284 bug. So basically all data in sstables created
after the upgrade are wrongly (non-uniformly within compaction levels)
distributed. This causes a huge overhead when compacting new sstables
(see the bug for the details).

After applying the patch, the distribution of the data within a level is
supposed to recover itself over time but we would like to not to wait a
month or so until it gets better.

So the question. What is the best way to recompact all the sstables so
the data in one sstables within a level would contain more or less the
right portion of the data, in other worlds, keys would be uniformly
distributed across sstables within a level? (e.g.: assumming total token
range for a node 1..1, and given that L2 should contain 100
sstables, , all sstables within L2 should cover a range of ~100 tokens).

Based on documentation, I can only think of switching to SizeTiered
compaction, doing major compaction and then switching back to LCS.

Thanks in advance
Jiri Horky