Re: Cassandra CLI showing inconsistent results during gets

2014-06-28 Thread Ravikumar Govindarajan
All inserts are at LOCAL_QUORUM DC1

I am confused because attempt-1 shows up the column, attempt-2 not found,
attempt-3 again shows it up.

These attempts were successive with no time delay from the same CLI!!! The
data also is not tinkered with CUD operations from somewhere else during
these times for sure.

--
Ravi

On Friday, June 27, 2014, Chris Lohfink clohf...@blackbirdit.com wrote:

 Where was the 09_09 column inserted from? Are you sure whatever did the
 insert is doing a local_quorum on the same DC the cli is in?  It may return
 before all the nodes get response back (ie 2 of the 3 in local DC) which
 report not having the data.  After all the nodes respond it will check the
 digests from all the responses, see theres an inconsistency and do a read
 repair.  Which would explain it showing up following queries.

 Chris

 On Jun 26, 2014, at 10:06 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com
 javascript:_e(%7B%7D,'cvml','ravikumar.govindara...@gmail.com'); wrote:

 I ran the following set of commands via CLI in our servers. There is a
 data-discrepancy that I encountered as below during gets...

 We are running 1.2.4 version with replication-factor=3 (DC1)  2 (DC2).
 Reads and writes are at LOCAL_QUORUM

 create column family TestCF with key_validation_class=AsciiType AND
 comparator = 'CompositeType(AsciiType,LongType)' AND
 compression_options={sstable_compression:SnappyCompressor,
 chunk_length_kb:64};

 [default@Sample] consistencylevel AS LOCAL_QUORUM;
 Consistency level is set to 'LOCAL_QUORUM'.

 [default@Sample] get TestCF [ascii('17732218001')] ['
 *177322104550009_:177322104560008*'];
 = (column=177322104550009_:177322104560008,
 value=31373733323231303030303034353530303039, timestamp=1397743374931)
 Elapsed time: 8.64 msec(s).

 //Do a full row dump which shows the above column
 [default@Sample] get TestCF [ascii('17732218001')];
 ...
 = (column=177322104547019_:177322104560001,
 value=31373733323231303030303034353437303139, timestamp=1397743139121)
 = (column=*177322104550009_:177322104560008*,
 value=31373733323231303030303034353530303039, timestamp=1397743374931)
 = (column=177322104560003_:177322104560005,
 value=31373733323231303030303034353630303033, timestamp=1397743323261)
 = (column=177322104562001_:177322104564003,
 value=31373733323231303030303034353632303031, timestamp=1397749523707)
 ---
 Returned 4771 results.
 Elapsed time: 518 msec(s).
 //Try again
 [default@Sample] get TestCF[ascii('17732218001')] ['
 *177322104550009_:177322104560008*'];
 = (column=177322104550009_:177322104560008,
 value=31373733323231303030303034353530303039, timestamp=1397743374931)
 Elapsed time: 8.03 msec(s).

 //Here CLI flipped showing value as not found
 [default@Sample] get TestCF[ascii('17732218001')] ['
 *177322104550009_:177322104550009*'];
 *Value was not found*
 Elapsed time: 12 msec(s).

 //Query again, it shows as value found
 [default@Sample]
 get TestCF[ascii('17732218001')] 
 ['177322104550009_:177322104550009'];
 = (column=177322104550009_:177322104550009,
 value=31373733323231303030303034353530303039, timestamp=1397743374931)
 Elapsed time: 23 msec(s).

 Is this just limited to CLI bug or some-thing deeper is brewing? Our app
 faced a serious issue in code involving this query. Is it a known issue?

 Any help is much appreciated

 --
 Ravi





Cassandra CLI showing inconsistent results during gets

2014-06-26 Thread Ravikumar Govindarajan
I ran the following set of commands via CLI in our servers. There is a
data-discrepancy that I encountered as below during gets...

We are running 1.2.4 version with replication-factor=3 (DC1)  2 (DC2).
Reads and writes are at LOCAL_QUORUM

create column family TestCF with key_validation_class=AsciiType AND
comparator = 'CompositeType(AsciiType,LongType)' AND
compression_options={sstable_compression:SnappyCompressor,
chunk_length_kb:64};

[default@Sample] consistencylevel AS LOCAL_QUORUM;
Consistency level is set to 'LOCAL_QUORUM'.

[default@Sample] get TestCF [ascii('17732218001')] ['
*177322104550009_:177322104560008*'];
= (column=177322104550009_:177322104560008,
value=31373733323231303030303034353530303039, timestamp=1397743374931)
Elapsed time: 8.64 msec(s).

//Do a full row dump which shows the above column
[default@Sample] get TestCF [ascii('17732218001')];
...
= (column=177322104547019_:177322104560001,
value=31373733323231303030303034353437303139, timestamp=1397743139121)
= (column=*177322104550009_:177322104560008*,
value=31373733323231303030303034353530303039, timestamp=1397743374931)
= (column=177322104560003_:177322104560005,
value=31373733323231303030303034353630303033, timestamp=1397743323261)
= (column=177322104562001_:177322104564003,
value=31373733323231303030303034353632303031, timestamp=1397749523707)
---
Returned 4771 results.
Elapsed time: 518 msec(s).
//Try again
[default@Sample] get TestCF[ascii('17732218001')] ['
*177322104550009_:177322104560008*'];
= (column=177322104550009_:177322104560008,
value=31373733323231303030303034353530303039, timestamp=1397743374931)
Elapsed time: 8.03 msec(s).

//Here CLI flipped showing value as not found
[default@Sample] get TestCF[ascii('17732218001')] ['
*177322104550009_:177322104550009*'];
*Value was not found*
Elapsed time: 12 msec(s).

//Query again, it shows as value found
[default@Sample]
get TestCF[ascii('17732218001')]
['177322104550009_:177322104550009'];
= (column=177322104550009_:177322104550009,
value=31373733323231303030303034353530303039, timestamp=1397743374931)
Elapsed time: 23 msec(s).

Is this just limited to CLI bug or some-thing deeper is brewing? Our app
faced a serious issue in code involving this query. Is it a known issue?

Any help is much appreciated

--
Ravi


Multi-range of composite query possible?

2013-11-27 Thread Ravikumar Govindarajan
We have the following structure in a composite CF, comprising 2 parts

Key=123  - A:1, A:2, A:3,B:1, B:2, B:3, B:4, C:1, C:2, C:3,

Our application provides the following inputs for querying on the
first-part of composite column

key=123, [(colName=A, range=2), (colName=B, range=3), (colName=C, range=1)]

The below output is desired

key=123 -- A:1, A:2 [Get first 2 composite cols for prefix 'A']

   B:1, B:2, B:3 [Get first 3 composite cols for prefix 'B']

   C:1 [Get the first composite col for prefix 'C']

I see that this akin to a range-of-range query via composite columns. Is
something like this possible in cassandra, may be in latest versions?

--
Ravi


Re: Deleting data using timestamp

2013-10-10 Thread Ravikumar Govindarajan
Thanks for the links. I wanted to avoid a major compaction somehow.

I see many JIRA issues on timestamps related to compaction/reads. So many
improvements have been proposed.

--
Ravi


On Thu, Oct 10, 2013 at 12:26 AM, Shahab Yunus shahab.yu...@gmail.comwrote:

 Ahh, yes, 'compaction'. I blanked out while mentioning repair and cleanup.
 That is in fact what needs to be done first and what I meant. Thanks
 Robert.

 Regards,
 Shahab


 On Wed, Oct 9, 2013 at 1:50 PM, Robert Coli rc...@eventbrite.com wrote:

 On Wed, Oct 9, 2013 at 7:35 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 What is the quick way to delete old-data and at the same time make sure
 read [doesn't] churn through all deleted columns?


 Use a database that isn't log structured?

 But seriously, in 2.0 there's this :

 https://issues.apache.org/jira/browse/CASSANDRA-5514

 Which allows for timestamp hints at query time.

 And...

 https://issues.apache.org/jira/browse/CASSANDRA-5228

 Which does compaction expiration of entire SSTables based on TTL.

 =Rob





RangeSliceCommand serialize issue

2013-10-10 Thread Ravikumar Govindarajan
We have suddenly started receiving RangeSliceCommand serializer errors.

We are running 1.2.4 version

This does not happen for Names based command. Only for Slice based
commands, we get this error.

Any help is greatly appreciated

ERROR [Thread-405] 2013-10-10 07:58:13,453 CassandraDaemon.java (line 174)
Exception in thread Thread[Thread-405,5,main]
java.lang.NegativeArraySizeException
at
org.apache.cassandra.dht.Token$TokenSerializer.deserialize(Token.java:97)
at
org.apache.cassandra.dht.AbstractBounds$AbstractBoundsSerializer.deserialize(AbstractBounds.java:172)
at
org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:297)
at
org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:179)
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:94)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:203)
at
org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:135)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:82)


Deleting data using timestamp

2013-10-09 Thread Ravikumar Govindarajan
We have wide-rows accumulated in a cassandra CF and now changed our
app-side logic.

The application now only wants first 7 days of data from this CF.

What is the quick way to delete old-data and at the same time make sure
read does churn through all deleted columns?

Lets say I do the following

for (each key in CF)
  drop key, with timestamp=(System.currentTimeMillis-7days)

What should I do in my read, to make sure that deleted columns don't get
examined.

I saw some advice on using max-timestamp per SSTable during read. Can
someone explain if that will solve my read problem here?

--
Ravi


Composite Column Grouping

2013-09-10 Thread Ravikumar Govindarajan
I have been faced with a problem of grouping composites on the second-part.

Lets say my CF contains this


TimeSeriesCF
   key:UserID
   composite-col-name:TimeUUID:PKID

Some sample data

UserID = XYZ
 Time:PKID
   Col-Name1 = 200:1000
   Col-Name2 = 201:1001
   Col-Name3 = 202:1000
   Col-Name4 = 203:1000
   Col-Name5 = 204:1002

Whenever a time-series query is issued, it should return the following in
time-desc order.

UserID = XYZ
  Col-Name5 = 204:1002
  Col-Name4 = 203:1000
  Col-Name2 = 201:1001

Is something like this possible in Cassandra? Is there a different way to
design and achieve the same objective?

--
Ravi


Re: Composite Column Grouping

2013-09-10 Thread Ravikumar Govindarajan
Thanks Michael,

But I cannot sort the rows in memory, as the number of columns will be
quite huge.

From the python script above:
   select_stmt = select * from time_series where userid = 'XYZ'

This would return me many hundreds of thousands of columns. I need to go in
time-series order using ranges [Pagination queries].


On Wed, Sep 11, 2013 at 7:06 AM, Laing, Michael
michael.la...@nytimes.comwrote:

 If you have set up the table as described in my previous message, you
 could run this python snippet to return the desired result:

 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 import logging
 logging.basicConfig()

 from operator import itemgetter

 import cassandra
 from cassandra.cluster import Cluster
 from cassandra.query import SimpleStatement

 cql_cluster = Cluster()
 cql_session = cql_cluster.connect()
 cql_session.set_keyspace('latest')

 select_stmt = select * from time_series where userid = 'XYZ'
 query = SimpleStatement(select_stmt)
 rows = cql_session.execute(query)

 results = []
 for row in rows:
 max_time = max(row.colname.keys())
 results.append((row.userid, row.pkid, max_time, row.colname[max_time]))

 sorted_results = sorted(results, key=itemgetter(2), reverse=True)
 for result in sorted_results: print result

 # prints:

 # (u'XYZ', u'1002', u'204', u'Col-Name-5')
 # (u'XYZ', u'1000', u'203', u'Col-Name-4')
 # (u'XYZ', u'1001', u'201', u'Col-Name-2')



 On Tue, Sep 10, 2013 at 6:32 PM, Laing, Michael michael.la...@nytimes.com
  wrote:

 You could try this. C* doesn't do it all for you, but it will efficiently
 get you the right data.

 -ml

 -- put this in file and run using 'cqlsh -f file

 DROP KEYSPACE latest;

 CREATE KEYSPACE latest WITH replication = {
 'class': 'SimpleStrategy',
 'replication_factor' : 1
 };

 USE latest;

 CREATE TABLE time_series (
 userid text,
 pkid text,
 colname maptext, text,
 PRIMARY KEY (userid, pkid)
 );

 UPDATE time_series SET colname = colname + {'200':'Col-Name-1'} WHERE
 userid = 'XYZ' AND pkid = '1000';
 UPDATE time_series SET colname = colname +
 {'201':'Col-Name-2'} WHERE userid = 'XYZ' AND pkid = '1001';
 UPDATE time_series SET colname = colname +
 {'202':'Col-Name-3'} WHERE userid = 'XYZ' AND pkid = '1000';
 UPDATE time_series SET colname = colname +
 {'203':'Col-Name-4'} WHERE userid = 'XYZ' AND pkid = '1000';
 UPDATE time_series SET colname = colname +
 {'204':'Col-Name-5'} WHERE userid = 'XYZ' AND pkid = '1002';

 SELECT * FROM time_series WHERE userid = 'XYZ';

 -- returns:
 -- userid | pkid | colname

 --+--+-
 --XYZ | 1000 | {'200': 'Col-Name-1', '202': 'Col-Name-3', '203':
 'Col-Name-4'}
 --XYZ | 1001 |   {'201':
 'Col-Name-2'}
 --XYZ | 1002 |   {'204':
 'Col-Name-5'}

 -- use an app to pop off the latest key/value from the map for each row,
 then sort by key desc.


 On Tue, Sep 10, 2013 at 9:21 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 I have been faced with a problem of grouping composites on the
 second-part.

 Lets say my CF contains this


 TimeSeriesCF
key:UserID
composite-col-name:TimeUUID:PKID

 Some sample data

 UserID = XYZ
  Time:PKID
Col-Name1 = 200:1000
Col-Name2 = 201:1001
Col-Name3 = 202:1000
Col-Name4 = 203:1000
Col-Name5 = 204:1002

 Whenever a time-series query is issued, it should return the following
 in time-desc order.

 UserID = XYZ
   Col-Name5 = 204:1002
   Col-Name4 = 203:1000
   Col-Name2 = 201:1001

 Is something like this possible in Cassandra? Is there a different way
 to design and achieve the same objective?

 --
 Ravi







Re: Key-Token mapping in cassandra

2013-04-19 Thread Ravikumar Govindarajan
I think I have simplified my example a little too much.

Lets assume that there are groups and users.

Ideally a grpId becomes the key and it holds some meta-data.

Lets say GroupMetaCF

grpId -- key, entityId -- col-name, blobdata -- col-value

Now we have a UserTimeSeriesCF

grpId/userId -- key, UUID -- col-name, entityId -- col-value

[Each user will view a subset of the grp data, based on roles etc...]

There are many more such CFs all with prefixes of grpId. By hashing grpId
to cassandra's token, I thought we can co-locate all the group's data into
one set of replicated nodes.

Is there a way to achieve this?

--
Ravi


On Thu, Apr 18, 2013 at 1:26 PM, aaron morton aa...@thelastpickle.comwrote:

 All rows with the same key go on the same nodes. So if you use the same
 row key in different CF's they will be on the same nodes. i.e. have CF's
 called Image, Documents, Meta and store rows in all of them with the 123
 key.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/04/2013, at 1:32 PM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 Thanks Aaron.
  We are looking at co-locating all keys for a given user in one Cassandra
 node.
 Are there any other ways to achieve this

 --
 Ravi

 On Thursday, April 18, 2013, aaron morton wrote:

 CASSANDRA-1034

 That ticket is about removing an assumption which was not correct.

 I would like all keys with 123 as prefix to be mapped to a single token.

 Why?
 it's not possible nor desirable IMHO. Tokens are used to identify a
 single row internally.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 17/04/2013, at 11:25 PM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 We would like to map multiple keys to a single token in cassandra. I
 believe this should be possible now with CASSANDRA-1034

 Ex:

 Key1 -- 123/IMAGE
 Key2 -- 123/DOCUMENTS
 Key3 -- 123/MULTIMEDIA

 I would like all keys with 123 as prefix to be mapped to a single token.

 Is this possible? What should be the Partitioner that I should most
 likely extend and write my own to achieve the desired result?

 --
 Ravi






Key-Token mapping in cassandra

2013-04-17 Thread Ravikumar Govindarajan
We would like to map multiple keys to a single token in cassandra. I
believe this should be possible now with CASSANDRA-1034

Ex:

Key1 -- 123/IMAGE
Key2 -- 123/DOCUMENTS
Key3 -- 123/MULTIMEDIA

I would like all keys with 123 as prefix to be mapped to a single token.

Is this possible? What should be the Partitioner that I should most likely
extend and write my own to achieve the desired result?

--
Ravi


Digest Query Seems to be corrupt on certain cases

2013-03-27 Thread Ravikumar Govindarajan
We started receiving OOMs in our cassandra grid and took a heap dump. We
are running version 1.0.7 with LOCAL_QUORUM from both reads/writes.

After some analysis, we kind of identified the problem, with
SliceByNamesReadCommand, involving a single Super-Column. This seems to be
happening only in digest query and not during actual reads.

I am pasting the serialized byte array of SliceByNamesReadCommand, which
seems to be corrupt on issuing certain digest queries.

//Type is SliceByNamesReadCommand
body[0] = (byte)1;
 //This is a digest query here.
body[1] = (byte)1;

//Table-Name from 2-8 bytes

//Key-Name from 9-18 bytes

//QueryPath deserialization here

 //CF-Name from 19-30 bytes

//Super-Col-Name from 31st byte onwards, but gets
corrupt as found in heap dump

//body[32-37] = 0, body[38] = 1, body[39] = 0.  This
causes the SliceByNamesDeserializer to mark both ColName=NULL and
SuperColName=NULL, fetching entire wide-row!!!

   //Actual super-col-name starts only from byte 40,
whereas it should have started from 31st byte itself

Has someone already encountered such an issue? Why is the super-col-name
not correctly de-serialized during digest query.

--
Ravi


Re: Digest Query Seems to be corrupt on certain cases

2013-03-27 Thread Ravikumar Govindarajan
VM Settings are
-javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42 -Xms8G -Xmx8G -Xmn800M
-XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

error stack was containing 2 threads for the same key, stalling on digest
query

The below bytes which I referred is the actual value of _body variable in
org.apache.cassandra.net.Message object got from the heap dump.

As I understand from the code, ReadVerbHandler will deserialize this
_body variable into a SliceByNamesReadCommand object.

When I manually inspected this byte array, it seems hold all details
correctly, except the super-column name, causing it to fetch the entire
wide row.

--
Ravi

On Thu, Mar 28, 2013 at 8:36 AM, aaron morton aa...@thelastpickle.comwrote:

 We started receiving OOMs in our cassandra grid and took a heap dump

 What are the JVM settings ?
 What was the error stack?

 I am pasting the serialized byte array of SliceByNamesReadCommand, which
 seems to be corrupt on issuing certain digest queries.

 Sorry I don't follow what you are saying here.
 Can you can you enable DEBUG logging and identify the behaviour you think
 is incorrect ?

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 28/03/2013, at 4:15 AM, Ravikumar Govindarajan 
 ravikumar.govindara...@gmail.com wrote:

 We started receiving OOMs in our cassandra grid and took a heap dump. We
 are running version 1.0.7 with LOCAL_QUORUM from both reads/writes.

 After some analysis, we kind of identified the problem, with
 SliceByNamesReadCommand, involving a single Super-Column. This seems to be
 happening only in digest query and not during actual reads.

 I am pasting the serialized byte array of SliceByNamesReadCommand, which
 seems to be corrupt on issuing certain digest queries.

 //Type is SliceByNamesReadCommand
  body[0] = (byte)1;
  //This is a digest query here.
  body[1] = (byte)1;

 //Table-Name from 2-8 bytes

 //Key-Name from 9-18 bytes

 //QueryPath deserialization here

  //CF-Name from 19-30 bytes

 //Super-Col-Name from 31st byte onwards, but gets
 corrupt as found in heap dump

 //body[32-37] = 0, body[38] = 1, body[39] = 0.  This
 causes the SliceByNamesDeserializer to mark both ColName=NULL and
 SuperColName=NULL, fetching entire wide-row!!!

//Actual super-col-name starts only from byte 40,
 whereas it should have started from 31st byte itself

 Has someone already encountered such an issue? Why is the super-col-name
 not correctly de-serialized during digest query.

 --
 Ravi





Re: Offsets and Range Queries

2012-11-15 Thread Ravikumar Govindarajan
Thanks Ed, for the clarifications

Yes you are correct that the apps have to handle repeatable reads and not
the databases themselves when using absolute offsets, but SQL databases do
provide such an option at app's peril!!!

Slices have a fixed size, this ensures that the the query does not
execute for arbitrary lengths of time.

I assume it's because of iterators in read-time, which go over results do
merging/reducing/collating results one-by-one that is not so well suited
for jumping to arbitrary offsets, given the practically huge number of
columns involved, right? Did I understand it correctly?

We are now faced with persisting the page with both first  last-key for
prev/next navigation. The problem gets quickly complex, when there we have
to support multiple pages per user. I just wanted to know, if there any
known work-arounds for this.

--
Ravi

On Thu, Nov 15, 2012 at 9:03 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 There are several reasons. First there is no absolute offset. The
 rows are sorted by the data. If someone inserts new data between your
 query and this query the rows have changed.

 Unless you doing select queries inside a transaction with repeatable
 read and your database supports this the query you mention does not
 really have absolute offsets  either. The results of the query can
 change between reads.

 In cassandra we do not execute large queries (that might results to
 temp tables or whatever) and allow you to page them. Slices have a
 fixed size, this ensures that the the query does not execute for
 arbitrary lengths of time.


 On Thu, Nov 15, 2012 at 6:39 AM, Ravikumar Govindarajan
 ravikumar.govindara...@gmail.com wrote:
  Usually we do a SELECT * FROM  ORDER BY  LIMIT 26,25 for
 pagination
  purpose, but specifying offset is not available for range queries in
  cassandra.
 
  I always have to specify a start-key to achieve this. Are there reasons
 for
  choosing such an approach rather than providing an absolute offset?
 
  --
  Ravi



Re: Composite Column Types Storage

2012-09-20 Thread Ravikumar Govindarajan
As I understand from the link below, burning column index-info onto the
sstable index files will not only eliminate sstables but also reduce disk
seeks from 3 to 2 for wide rows.

Our index files are always mmapped, so there is only one random seek for a
named column query. I think that is a wonderful improvement

Shouldn't we be wary of the spike in heap usage by promoting column indexes
to index file?

It should be nice to have say 128th entry written out to disk, while load
every 512th index in memory during start-up, just as a balancing factor?

--
Ravi

On Tue, Sep 18, 2012 at 4:47 PM, Sylvain Lebresne sylv...@datastax.comwrote:

  Range queries do not use bloom filters. It holds good for
 composite-columns
  also right?

 Since I assume you are referring to column's bloom filters (key's bloom
 filters
 are always used) then yes, that holds good for composite columns.
 Currently,
 composite column name are completely opaque to the storage engine.

  Column-part-1 alone could have gone into the bloom-filter, speeding up
 my
  queries really effectively

 True, though https://issues.apache.org/jira/browse/CASSANDRA-2319 (in 1.2
 only
 however) should help quite a lot here. Basically it will allow to skip the
 sstable based on the column index. Granted, this is less fined grained
 than a
 bloom filter (though on the other side there is no false positive), but I
 suspect that in most real life workload it won't be too much worse.

 --
 Sylvain



Re: Composite Column Types Storage

2012-09-12 Thread Ravikumar Govindarajan
Thanks for the clarification. Even though compression solves disk space
issue, we might still have Memtable bloat right?

There is another issue to be handled for us. The queries are always going
to be range queries with absolute match on part1 and range on part 2 of the
composite columns

Ex: Query some-key Column-part-1 Start-Id-part-2 Limit

Range queries do not use bloom filters. It holds good for composite-columns
also right? I believe I will end up writing BF bytes only to skip it later.

If sharing had been possible, then Column-part-1 alone could have gone
into the bloom-filter, speeding up my queries really effectively.

But as I understand, there are many levels of nesting possible in a
composite type and casing at every level is a big task

May be casing for the top-level or the first-part should be a good start?

--
Ravi

On Wed, Sep 12, 2012 at 5:46 PM, Sylvain Lebresne sylv...@datastax.comwrote:

  Is every string/id combination stored separately in disk

 Yes, each combination is stored separately on disk (the storage engine
 itself doesn't have special casing for composite column, at least not
 yet). But as far as disk space is concerned, I suspect that sstable
 compression makes this largely a non issue.

 --
 Sylvain



Re: GC freeze just after repair session

2012-07-06 Thread Ravikumar Govindarajan
Our Young size=800 MB,SurvivorRatio=8,edenSize=640MB. All objects/bytes
generated during compaction are garbage right?

During compaction, with in_memory_compaction_limit=64MB and
concurrent_compactors=8,  there is a lot of pressure on ParNew sweeps.

I was thinking of decreasing concurrent_compactors and
in_memory_compaction_limit to go easy on GC

 I am not familiar with inner workings of cassandra but hope have diagnosed
the problem to a little extent.

On Fri, Jul 6, 2012 at 11:27 AM, rohit bhatia rohit2...@gmail.com wrote:

 @ravi, u can increase young gen size, keep a high tenuring rate or
 increase survivor ratio..


 On Fri, Jul 6, 2012 at 4:03 AM, aaron morton aa...@thelastpickle.com
 wrote:
  Ideally we would like to collect maximum garbage from ParNew itself,
 during
  compactions. What are the steps to take towards to achieving this?
 
  I'm not sure what you are asking.
 
  Cheers
 
  -
  Aaron Morton
  Freelance Developer
  @aaronmorton
  http://www.thelastpickle.com
 
  On 5/07/2012, at 6:56 PM, Ravikumar Govindarajan wrote:
 
  We have modified maxTenuringThreshold from 1 to 5. May be it is causing
  problems. Will change it back to 1 and see how the system is.
 
  concurrent_compactors=8. We will reduce this, as anyway our system won't
 be
  able to handle this number of compactions at the same time. Think it will
  ease GC also to some extent.
 
  Ideally we would like to collect maximum garbage from ParNew itself,
 during
  compactions. What are the steps to take towards to achieving this?
 
  On Wed, Jul 4, 2012 at 4:07 PM, aaron morton aa...@thelastpickle.com
  wrote:
 
  It *may* have been compaction from the repair, but it's not a big CF.
 
  I would look at the logs to see how much data was transferred to the
 node.
  Was their a compaction going on while the GC storm was happening ? Do
 you
  have a lot of secondary indexes ?
 
  If you think it correlated to compaction you can try reducing the
  concurrent_compactors
 
  Cheers
 
  -
  Aaron Morton
  Freelance Developer
  @aaronmorton
  http://www.thelastpickle.com
 
  On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote:
 
  Recently, we faced a severe freeze [around 30-40 mins] on one of our
  servers. There were many mutations/reads dropped. The issue happened
 just
  after a routine nodetool repair for the below CF completed [1.0.7, NTS,
  DC1:3,DC2:2]
 
  Column Family: MsgIrtConv
  SSTable count: 12
  Space used (live): 17426379140
  Space used (total): 17426379140
  Number of Keys (estimate): 122624
  Memtable Columns Count: 31180
  Memtable Data Size: 81950175
  Memtable Switch Count: 31
  Read Count: 8074156
  Read Latency: 15.743 ms.
  Write Count: 2172404
  Write Latency: 0.037 ms.
  Pending Tasks: 0
  Bloom Filter False Postives: 1258
  Bloom Filter False Ratio: 0.03598
  Bloom Filter Space Used: 498672
  Key cache capacity: 20
  Key cache size: 20
  Key cache hit rate: 0.9965579513062582
  Row cache: disabled
  Compacted row minimum size: 51
  Compacted row maximum size: 89970660
  Compacted row mean size: 226626
 
 
  Our heap config is as follows
 
  -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC
  -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
 -XX:SurvivorRatio=8
  -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75
  -XX:+UseCMSInitiatingOccupancyOnly
 
  from yaml
  in_memory_compaction_limit=64
  compaction_throughput_mb_sec=8
  multi_threaded_compaction=false
 
   INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085
 AntiEntropyService.java
  (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is
  fully synced
   INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085
  AntiEntropyService.java (line 698) [repair
  #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully
   INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219
 CompactionTask.java
  (line 221) Compacted to
  [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,].  47,907,012 to
  40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s.
  Time:
  6,186ms.
 
  After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran
  for every 3 seconds, while CMS ran for every 30 seconds approx
 continuous
  for 40 minutes.
 
   INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line
  122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is
  8506048512
   INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line
  122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is
  8506048512
 
  .
 
   INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line
  122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is
  8506048512
   INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line
  122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is
  8506048512
   INFO [ScheduledTasks

Re: GC freeze just after repair session

2012-07-05 Thread Ravikumar Govindarajan
We have modified maxTenuringThreshold from 1 to 5. May be it is causing
problems. Will change it back to 1 and see how the system is.

concurrent_compactors=8. We will reduce this, as anyway our system won't be
able to handle this number of compactions at the same time. Think it will
ease GC also to some extent.

Ideally we would like to collect maximum garbage from ParNew itself, during
compactions. What are the steps to take towards to achieving this?

On Wed, Jul 4, 2012 at 4:07 PM, aaron morton aa...@thelastpickle.comwrote:

 It *may* have been compaction from the repair, but it's not a big CF.

 I would look at the logs to see how much data was transferred to the node.
 Was their a compaction going on while the GC storm was happening ? Do you
 have a lot of secondary indexes ?

 If you think it correlated to compaction you can try reducing the
 concurrent_compactors

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote:

 Recently, we faced a severe freeze [around 30-40 mins] on one of our
 servers. There were many mutations/reads dropped. The issue happened just
 after a routine nodetool repair for the below CF completed [1.0.7, NTS,
 DC1:3,DC2:2]

 Column Family: MsgIrtConv
 SSTable count: 12
 Space used (live): 17426379140
  Space used (total): 17426379140
 Number of Keys (estimate): 122624
 Memtable Columns Count: 31180
  Memtable Data Size: 81950175
 Memtable Switch Count: 31
 Read Count: 8074156
  Read Latency: 15.743 ms.
 Write Count: 2172404
 Write Latency: 0.037 ms.
  Pending Tasks: 0
 Bloom Filter False Postives: 1258
 Bloom Filter False Ratio: 0.03598
  Bloom Filter Space Used: 498672
 Key cache capacity: 20
 Key cache size: 20
  Key cache hit rate: 0.9965579513062582
 Row cache: disabled
 Compacted row minimum size: 51
  Compacted row maximum size: 89970660
 Compacted row mean size: 226626


 Our heap config is as follows

 -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC
 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
 -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75
 -XX:+UseCMSInitiatingOccupancyOnly

 from yaml
 in_memory_compaction_limit=64
 compaction_throughput_mb_sec=8
 multi_threaded_compaction=false

  INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085AntiEntropyService.java 
 (line 762) [repair
 #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is fully synced
  INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085AntiEntropyService.java 
 (line 698) [repair
 #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully
  INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219CompactionTask.java 
 (line 221) Compacted to
 [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,].  47,907,012 to
 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s.  Time:
 6,186ms.

 After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran
 for every 3 seconds, while CMS ran for every 30 seconds approx continuous
 for 40 minutes.

  INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line
 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is
 8506048512
  INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line
 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is
 8506048512

 .

  INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line
 122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is
 8506048512
  INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line
 122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is
 8506048512
  INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line
 122) GC for ParNew: 1444 ms for 3 collections, 4234372296 used; max is
 8506048512
  INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line
 122) GC for ParNew: 1153 ms for 2 collections, 4910279080 used; max is
 8506048512
  INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector.java (line
 122) GC for ParNew: 697 ms for 2 collections, 4873857072 used; max is
 8506048512
  INFO [ScheduledTasks:1] 2012-06-29 10:08:01,443 GCInspector.java (line
 122) GC for ParNew: 726 ms for 2 collections, 4941511184 used; max is
 8506048512

 After this, the node got stable and was back and running. Any pointers
 will be greatly helpful





Re: MurmurHash NPE during compaction

2012-06-22 Thread Ravikumar Govindarajan
Thanks Aaron.

Created a ticket https://issues.apache.org/jira/browse/CASSANDRA-4367

Funnny thing is, I don't see any of the SSTables that participated in the
failed compaction.

Will do an upgradesstables and find out if problem still persists


On Mon, Jun 18, 2012 at 6:43 AM, aaron morton aa...@thelastpickle.comwrote:

 Can you please create a ticket on
 https://issues.apache.org/jira/browse/CASSANDRA

 Please include:
 * CF definition including the bloom_filter_fp_chance
 * If the data was upgraded from a previous version of cassandra.
 * The names of the files that were being compacted.

 As a work around you can try using nodetool upgradetables to re-write the
 files - this may also fail, but its could be worth trying.

 The next step would be to remove determine which files were causing the
 issue (looking at the logs) and remove them from the data directory. Then
 run repair to restore consistency.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 14/06/2012, at 11:38 PM, Ravikumar Govindarajan wrote:

 We received the following NPE during compaction of a large row. We are on
 cassandra-1.0.7. Need some help here to find the root cause of the issue

  ERROR [CompactionExecutor:595] 2012-06-13 09:44:46,718
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[CompactionExecutor:595,1,main]
 java.lang.NullPointerException
 at
 org.apache.cassandra.utils.MurmurHash.hash64(MurmurHash.java:102)
 at
 org.apache.cassandra.utils.BloomFilter.getHashBuckets(BloomFilter.java:103)
 at
 org.apache.cassandra.utils.BloomFilter.getHashBuckets(BloomFilter.java:92)
 at org.apache.cassandra.utils.BloomFilter.add(BloomFilter.java:114)
 at
 org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:96)
 at
 org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:51)
 at
 org.apache.cassandra.db.compaction.PrecompactedRow.write(PrecompactedRow.java:135)
 at
 org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160)
 at
 org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159)
 at
 org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:134)
 at
 org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:114)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)
 at java.lang.Thread.run(Thread.java:619)

 Thanks and Regards,
 Ravi





Re: migrating from SimpleStrategy to NetworkTopologyStrategy

2012-04-19 Thread Ravikumar Govindarajan
We tried this route previously. We did not run repair at all {our use-cases
don't need a repair} but while adding a secondary data center, we were
forced to run repair. It ended up exploding the data.

We finally had to start afresh, scrapped the cluster and re-import the data
with NTS. Now, whether we require repair or not, we are running it
regularly!!!

I feel that it should be alright to migrate to NTS, if you run repairs
regularly and keep the cluster healthy.

Regards,
Ravi

On Fri, Apr 20, 2012 at 2:20 AM, aaron morton aa...@thelastpickle.comwrote:

 There is this, it's old..
 http://wiki.apache.org/cassandra/Operations#Replication
 There was also a discussion about it in the last month or so.

 i *think* it's ok so long as you move to a single DC and single rack. But
 please test.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 20/04/2012, at 5:03 AM, Marcus Both wrote:

 I think that is enough to do an update on keyspace, for example
 (cassandra-cli):
 update keyspace KEYSPACE with placement_strategy =
 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options
 = {datacenter1: 1};

 On Thu, 19 Apr 2012 16:18:46 +0100
 simojenki simoje...@gmail.com wrote:

 Hi,


 Is there any documentation on what the procedure for migrating from

 SimpleStrategy to NetworkTopologyStrategy?


 thanks


 Simon




 --
 Marcus Both





Nodetool ring and multiple dc

2012-02-09 Thread Ravikumar Govindarajan
Hi,

I was trying to setup a backup DC from existing DC.

State of existing DC with SimpleStrategy  rep_factor=1.

./nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
   Token


   85070591730234615865843651857942052864
XXX.YYYDC1 RAC1Up Normal  187.69 MB
50.00%  0
XXX.ZZZ   DC1 RAC1Up Normal  187.77 MB   50.00%
 85070591730234615865843651857942052864

After adding backup DC with NetworkTopologyStrategy {DC1:1,DC2:1}, the
output is as follows

./nodetool -h localhost ring
Address DC  RackStatus State   LoadOwns
   Token

   85070591730234615865843651857942052864
XXX.YYYDC1 RAC1Up Normal  187.69 MB   50.00%  0

AAA.BBBDC2 RAC1Up Normal  374.59 MB   11.99%
 20392907958956928593056220689159358496
XXX.ZZZDC1 RAC1Up Normal  187.77 MB   38.01%
 85070591730234615865843651857942052864

As per our app rules, all writes will first go through DC1 and then find
it's way to DC2. Since the Owns percentage has drastically changed, will
it mean that the DC1 nodes will become unbalanced for future writes?

We have a very balanced ring in our production with all nodes serving
almost equal volume data as of now in DC1. Will setting up a backup DC2
disturb the balance?

Thanks and Regards,
Ravi


Re: Nodetool ring and multiple dc

2012-02-09 Thread Ravikumar Govindarajan
Thanks David, for the clarification.

I feel it would be better if nodetool ring reports per-dc token space
ownerships to correctly reflect what cassandra is internally doing, instead
of global token space ownership.

- Ravi

On Fri, Feb 10, 2012 at 12:42 PM, David Schairer dschai...@humbaba.netwrote:

 nodetool ring is, IMHO, quite confusing in the case of multiple
 datacenters.  Might be easier to think of it as two rings:

 in your DC1 ring you have two nodes, and since the tokens are balanced,
 assuming your rows are randomly distributed you'll have half the data on
 each, since your replication factor in DC is 1.

 In your DC2 'ring' you have one node, and with a replication factor of 1
 in DC2, all data will go on that node.

 So you would expect to have n MB of data on XXX.YYY and XXX.ZZZ and 2n MB
 of data on AAA and BBB, and that's what you have, to a T.  :)

 In other words, the fact that you injecte node AAA.BBB with a token that
 seems to divide the ring into uneven portions, because the DC1 ring is only
 DC1, it's not left unbalanced by the new node. If you added a second node
 to DC2 you would want to give it a token of something like
 106338239662793269832304564822427565952 so that the DC2 is also evenly
 balanced.

 --DRS

 On Feb 9, 2012, at 11:00 PM, Ravikumar Govindarajan wrote:

  Hi,
 
  I was trying to setup a backup DC from existing DC.
 
  State of existing DC with SimpleStrategy  rep_factor=1.
 
  ./nodetool -h localhost ring
  Address DC  RackStatus State   Load
  OwnsToken
 
 
85070591730234615865843651857942052864
  XXX.YYYDC1 RAC1Up Normal  187.69 MB
 50.00%  0
  XXX.ZZZ   DC1 RAC1Up Normal  187.77 MB
 50.00%  85070591730234615865843651857942052864
 
  After adding backup DC with NetworkTopologyStrategy {DC1:1,DC2:1}, the
 output is as follows
 
  ./nodetool -h localhost ring
  Address DC  RackStatus State   Load
  OwnsToken
 
85070591730234615865843651857942052864
  XXX.YYYDC1 RAC1Up Normal  187.69 MB   50.00%
  0
  AAA.BBBDC2 RAC1Up Normal  374.59 MB   11.99%
  20392907958956928593056220689159358496
  XXX.ZZZDC1 RAC1Up Normal  187.77 MB   38.01%
  85070591730234615865843651857942052864
 
  As per our app rules, all writes will first go through DC1 and then find
 it's way to DC2. Since the Owns percentage has drastically changed, will
 it mean that the DC1 nodes will become unbalanced for future writes?
 
  We have a very balanced ring in our production with all nodes serving
 almost equal volume data as of now in DC1. Will setting up a backup DC2
 disturb the balance?
 
  Thanks and Regards,
  Ravi