Re: Cassandra CLI showing inconsistent results during gets
All inserts are at LOCAL_QUORUM DC1 I am confused because attempt-1 shows up the column, attempt-2 not found, attempt-3 again shows it up. These attempts were successive with no time delay from the same CLI!!! The data also is not tinkered with CUD operations from somewhere else during these times for sure. -- Ravi On Friday, June 27, 2014, Chris Lohfink clohf...@blackbirdit.com wrote: Where was the 09_09 column inserted from? Are you sure whatever did the insert is doing a local_quorum on the same DC the cli is in? It may return before all the nodes get response back (ie 2 of the 3 in local DC) which report not having the data. After all the nodes respond it will check the digests from all the responses, see theres an inconsistency and do a read repair. Which would explain it showing up following queries. Chris On Jun 26, 2014, at 10:06 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com javascript:_e(%7B%7D,'cvml','ravikumar.govindara...@gmail.com'); wrote: I ran the following set of commands via CLI in our servers. There is a data-discrepancy that I encountered as below during gets... We are running 1.2.4 version with replication-factor=3 (DC1) 2 (DC2). Reads and writes are at LOCAL_QUORUM create column family TestCF with key_validation_class=AsciiType AND comparator = 'CompositeType(AsciiType,LongType)' AND compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64}; [default@Sample] consistencylevel AS LOCAL_QUORUM; Consistency level is set to 'LOCAL_QUORUM'. [default@Sample] get TestCF [ascii('17732218001')] [' *177322104550009_:177322104560008*']; = (column=177322104550009_:177322104560008, value=31373733323231303030303034353530303039, timestamp=1397743374931) Elapsed time: 8.64 msec(s). //Do a full row dump which shows the above column [default@Sample] get TestCF [ascii('17732218001')]; ... = (column=177322104547019_:177322104560001, value=31373733323231303030303034353437303139, timestamp=1397743139121) = (column=*177322104550009_:177322104560008*, value=31373733323231303030303034353530303039, timestamp=1397743374931) = (column=177322104560003_:177322104560005, value=31373733323231303030303034353630303033, timestamp=1397743323261) = (column=177322104562001_:177322104564003, value=31373733323231303030303034353632303031, timestamp=1397749523707) --- Returned 4771 results. Elapsed time: 518 msec(s). //Try again [default@Sample] get TestCF[ascii('17732218001')] [' *177322104550009_:177322104560008*']; = (column=177322104550009_:177322104560008, value=31373733323231303030303034353530303039, timestamp=1397743374931) Elapsed time: 8.03 msec(s). //Here CLI flipped showing value as not found [default@Sample] get TestCF[ascii('17732218001')] [' *177322104550009_:177322104550009*']; *Value was not found* Elapsed time: 12 msec(s). //Query again, it shows as value found [default@Sample] get TestCF[ascii('17732218001')] ['177322104550009_:177322104550009']; = (column=177322104550009_:177322104550009, value=31373733323231303030303034353530303039, timestamp=1397743374931) Elapsed time: 23 msec(s). Is this just limited to CLI bug or some-thing deeper is brewing? Our app faced a serious issue in code involving this query. Is it a known issue? Any help is much appreciated -- Ravi
Cassandra CLI showing inconsistent results during gets
I ran the following set of commands via CLI in our servers. There is a data-discrepancy that I encountered as below during gets... We are running 1.2.4 version with replication-factor=3 (DC1) 2 (DC2). Reads and writes are at LOCAL_QUORUM create column family TestCF with key_validation_class=AsciiType AND comparator = 'CompositeType(AsciiType,LongType)' AND compression_options={sstable_compression:SnappyCompressor, chunk_length_kb:64}; [default@Sample] consistencylevel AS LOCAL_QUORUM; Consistency level is set to 'LOCAL_QUORUM'. [default@Sample] get TestCF [ascii('17732218001')] [' *177322104550009_:177322104560008*']; = (column=177322104550009_:177322104560008, value=31373733323231303030303034353530303039, timestamp=1397743374931) Elapsed time: 8.64 msec(s). //Do a full row dump which shows the above column [default@Sample] get TestCF [ascii('17732218001')]; ... = (column=177322104547019_:177322104560001, value=31373733323231303030303034353437303139, timestamp=1397743139121) = (column=*177322104550009_:177322104560008*, value=31373733323231303030303034353530303039, timestamp=1397743374931) = (column=177322104560003_:177322104560005, value=31373733323231303030303034353630303033, timestamp=1397743323261) = (column=177322104562001_:177322104564003, value=31373733323231303030303034353632303031, timestamp=1397749523707) --- Returned 4771 results. Elapsed time: 518 msec(s). //Try again [default@Sample] get TestCF[ascii('17732218001')] [' *177322104550009_:177322104560008*']; = (column=177322104550009_:177322104560008, value=31373733323231303030303034353530303039, timestamp=1397743374931) Elapsed time: 8.03 msec(s). //Here CLI flipped showing value as not found [default@Sample] get TestCF[ascii('17732218001')] [' *177322104550009_:177322104550009*']; *Value was not found* Elapsed time: 12 msec(s). //Query again, it shows as value found [default@Sample] get TestCF[ascii('17732218001')] ['177322104550009_:177322104550009']; = (column=177322104550009_:177322104550009, value=31373733323231303030303034353530303039, timestamp=1397743374931) Elapsed time: 23 msec(s). Is this just limited to CLI bug or some-thing deeper is brewing? Our app faced a serious issue in code involving this query. Is it a known issue? Any help is much appreciated -- Ravi
Multi-range of composite query possible?
We have the following structure in a composite CF, comprising 2 parts Key=123 - A:1, A:2, A:3,B:1, B:2, B:3, B:4, C:1, C:2, C:3, Our application provides the following inputs for querying on the first-part of composite column key=123, [(colName=A, range=2), (colName=B, range=3), (colName=C, range=1)] The below output is desired key=123 -- A:1, A:2 [Get first 2 composite cols for prefix 'A'] B:1, B:2, B:3 [Get first 3 composite cols for prefix 'B'] C:1 [Get the first composite col for prefix 'C'] I see that this akin to a range-of-range query via composite columns. Is something like this possible in cassandra, may be in latest versions? -- Ravi
Re: Deleting data using timestamp
Thanks for the links. I wanted to avoid a major compaction somehow. I see many JIRA issues on timestamps related to compaction/reads. So many improvements have been proposed. -- Ravi On Thu, Oct 10, 2013 at 12:26 AM, Shahab Yunus shahab.yu...@gmail.comwrote: Ahh, yes, 'compaction'. I blanked out while mentioning repair and cleanup. That is in fact what needs to be done first and what I meant. Thanks Robert. Regards, Shahab On Wed, Oct 9, 2013 at 1:50 PM, Robert Coli rc...@eventbrite.com wrote: On Wed, Oct 9, 2013 at 7:35 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: What is the quick way to delete old-data and at the same time make sure read [doesn't] churn through all deleted columns? Use a database that isn't log structured? But seriously, in 2.0 there's this : https://issues.apache.org/jira/browse/CASSANDRA-5514 Which allows for timestamp hints at query time. And... https://issues.apache.org/jira/browse/CASSANDRA-5228 Which does compaction expiration of entire SSTables based on TTL. =Rob
RangeSliceCommand serialize issue
We have suddenly started receiving RangeSliceCommand serializer errors. We are running 1.2.4 version This does not happen for Names based command. Only for Slice based commands, we get this error. Any help is greatly appreciated ERROR [Thread-405] 2013-10-10 07:58:13,453 CassandraDaemon.java (line 174) Exception in thread Thread[Thread-405,5,main] java.lang.NegativeArraySizeException at org.apache.cassandra.dht.Token$TokenSerializer.deserialize(Token.java:97) at org.apache.cassandra.dht.AbstractBounds$AbstractBoundsSerializer.deserialize(AbstractBounds.java:172) at org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:297) at org.apache.cassandra.db.RangeSliceCommandSerializer.deserialize(RangeSliceCommand.java:179) at org.apache.cassandra.net.MessageIn.read(MessageIn.java:94) at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:203) at org.apache.cassandra.net.IncomingTcpConnection.handleModernVersion(IncomingTcpConnection.java:135) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:82)
Deleting data using timestamp
We have wide-rows accumulated in a cassandra CF and now changed our app-side logic. The application now only wants first 7 days of data from this CF. What is the quick way to delete old-data and at the same time make sure read does churn through all deleted columns? Lets say I do the following for (each key in CF) drop key, with timestamp=(System.currentTimeMillis-7days) What should I do in my read, to make sure that deleted columns don't get examined. I saw some advice on using max-timestamp per SSTable during read. Can someone explain if that will solve my read problem here? -- Ravi
Composite Column Grouping
I have been faced with a problem of grouping composites on the second-part. Lets say my CF contains this TimeSeriesCF key:UserID composite-col-name:TimeUUID:PKID Some sample data UserID = XYZ Time:PKID Col-Name1 = 200:1000 Col-Name2 = 201:1001 Col-Name3 = 202:1000 Col-Name4 = 203:1000 Col-Name5 = 204:1002 Whenever a time-series query is issued, it should return the following in time-desc order. UserID = XYZ Col-Name5 = 204:1002 Col-Name4 = 203:1000 Col-Name2 = 201:1001 Is something like this possible in Cassandra? Is there a different way to design and achieve the same objective? -- Ravi
Re: Composite Column Grouping
Thanks Michael, But I cannot sort the rows in memory, as the number of columns will be quite huge. From the python script above: select_stmt = select * from time_series where userid = 'XYZ' This would return me many hundreds of thousands of columns. I need to go in time-series order using ranges [Pagination queries]. On Wed, Sep 11, 2013 at 7:06 AM, Laing, Michael michael.la...@nytimes.comwrote: If you have set up the table as described in my previous message, you could run this python snippet to return the desired result: #!/usr/bin/env python # -*- coding: utf-8 -*- import logging logging.basicConfig() from operator import itemgetter import cassandra from cassandra.cluster import Cluster from cassandra.query import SimpleStatement cql_cluster = Cluster() cql_session = cql_cluster.connect() cql_session.set_keyspace('latest') select_stmt = select * from time_series where userid = 'XYZ' query = SimpleStatement(select_stmt) rows = cql_session.execute(query) results = [] for row in rows: max_time = max(row.colname.keys()) results.append((row.userid, row.pkid, max_time, row.colname[max_time])) sorted_results = sorted(results, key=itemgetter(2), reverse=True) for result in sorted_results: print result # prints: # (u'XYZ', u'1002', u'204', u'Col-Name-5') # (u'XYZ', u'1000', u'203', u'Col-Name-4') # (u'XYZ', u'1001', u'201', u'Col-Name-2') On Tue, Sep 10, 2013 at 6:32 PM, Laing, Michael michael.la...@nytimes.com wrote: You could try this. C* doesn't do it all for you, but it will efficiently get you the right data. -ml -- put this in file and run using 'cqlsh -f file DROP KEYSPACE latest; CREATE KEYSPACE latest WITH replication = { 'class': 'SimpleStrategy', 'replication_factor' : 1 }; USE latest; CREATE TABLE time_series ( userid text, pkid text, colname maptext, text, PRIMARY KEY (userid, pkid) ); UPDATE time_series SET colname = colname + {'200':'Col-Name-1'} WHERE userid = 'XYZ' AND pkid = '1000'; UPDATE time_series SET colname = colname + {'201':'Col-Name-2'} WHERE userid = 'XYZ' AND pkid = '1001'; UPDATE time_series SET colname = colname + {'202':'Col-Name-3'} WHERE userid = 'XYZ' AND pkid = '1000'; UPDATE time_series SET colname = colname + {'203':'Col-Name-4'} WHERE userid = 'XYZ' AND pkid = '1000'; UPDATE time_series SET colname = colname + {'204':'Col-Name-5'} WHERE userid = 'XYZ' AND pkid = '1002'; SELECT * FROM time_series WHERE userid = 'XYZ'; -- returns: -- userid | pkid | colname --+--+- --XYZ | 1000 | {'200': 'Col-Name-1', '202': 'Col-Name-3', '203': 'Col-Name-4'} --XYZ | 1001 | {'201': 'Col-Name-2'} --XYZ | 1002 | {'204': 'Col-Name-5'} -- use an app to pop off the latest key/value from the map for each row, then sort by key desc. On Tue, Sep 10, 2013 at 9:21 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: I have been faced with a problem of grouping composites on the second-part. Lets say my CF contains this TimeSeriesCF key:UserID composite-col-name:TimeUUID:PKID Some sample data UserID = XYZ Time:PKID Col-Name1 = 200:1000 Col-Name2 = 201:1001 Col-Name3 = 202:1000 Col-Name4 = 203:1000 Col-Name5 = 204:1002 Whenever a time-series query is issued, it should return the following in time-desc order. UserID = XYZ Col-Name5 = 204:1002 Col-Name4 = 203:1000 Col-Name2 = 201:1001 Is something like this possible in Cassandra? Is there a different way to design and achieve the same objective? -- Ravi
Re: Key-Token mapping in cassandra
I think I have simplified my example a little too much. Lets assume that there are groups and users. Ideally a grpId becomes the key and it holds some meta-data. Lets say GroupMetaCF grpId -- key, entityId -- col-name, blobdata -- col-value Now we have a UserTimeSeriesCF grpId/userId -- key, UUID -- col-name, entityId -- col-value [Each user will view a subset of the grp data, based on roles etc...] There are many more such CFs all with prefixes of grpId. By hashing grpId to cassandra's token, I thought we can co-locate all the group's data into one set of replicated nodes. Is there a way to achieve this? -- Ravi On Thu, Apr 18, 2013 at 1:26 PM, aaron morton aa...@thelastpickle.comwrote: All rows with the same key go on the same nodes. So if you use the same row key in different CF's they will be on the same nodes. i.e. have CF's called Image, Documents, Meta and store rows in all of them with the 123 key. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 18/04/2013, at 1:32 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Thanks Aaron. We are looking at co-locating all keys for a given user in one Cassandra node. Are there any other ways to achieve this -- Ravi On Thursday, April 18, 2013, aaron morton wrote: CASSANDRA-1034 That ticket is about removing an assumption which was not correct. I would like all keys with 123 as prefix to be mapped to a single token. Why? it's not possible nor desirable IMHO. Tokens are used to identify a single row internally. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 17/04/2013, at 11:25 PM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: We would like to map multiple keys to a single token in cassandra. I believe this should be possible now with CASSANDRA-1034 Ex: Key1 -- 123/IMAGE Key2 -- 123/DOCUMENTS Key3 -- 123/MULTIMEDIA I would like all keys with 123 as prefix to be mapped to a single token. Is this possible? What should be the Partitioner that I should most likely extend and write my own to achieve the desired result? -- Ravi
Key-Token mapping in cassandra
We would like to map multiple keys to a single token in cassandra. I believe this should be possible now with CASSANDRA-1034 Ex: Key1 -- 123/IMAGE Key2 -- 123/DOCUMENTS Key3 -- 123/MULTIMEDIA I would like all keys with 123 as prefix to be mapped to a single token. Is this possible? What should be the Partitioner that I should most likely extend and write my own to achieve the desired result? -- Ravi
Digest Query Seems to be corrupt on certain cases
We started receiving OOMs in our cassandra grid and took a heap dump. We are running version 1.0.7 with LOCAL_QUORUM from both reads/writes. After some analysis, we kind of identified the problem, with SliceByNamesReadCommand, involving a single Super-Column. This seems to be happening only in digest query and not during actual reads. I am pasting the serialized byte array of SliceByNamesReadCommand, which seems to be corrupt on issuing certain digest queries. //Type is SliceByNamesReadCommand body[0] = (byte)1; //This is a digest query here. body[1] = (byte)1; //Table-Name from 2-8 bytes //Key-Name from 9-18 bytes //QueryPath deserialization here //CF-Name from 19-30 bytes //Super-Col-Name from 31st byte onwards, but gets corrupt as found in heap dump //body[32-37] = 0, body[38] = 1, body[39] = 0. This causes the SliceByNamesDeserializer to mark both ColName=NULL and SuperColName=NULL, fetching entire wide-row!!! //Actual super-col-name starts only from byte 40, whereas it should have started from 31st byte itself Has someone already encountered such an issue? Why is the super-col-name not correctly de-serialized during digest query. -- Ravi
Re: Digest Query Seems to be corrupt on certain cases
VM Settings are -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly error stack was containing 2 threads for the same key, stalling on digest query The below bytes which I referred is the actual value of _body variable in org.apache.cassandra.net.Message object got from the heap dump. As I understand from the code, ReadVerbHandler will deserialize this _body variable into a SliceByNamesReadCommand object. When I manually inspected this byte array, it seems hold all details correctly, except the super-column name, causing it to fetch the entire wide row. -- Ravi On Thu, Mar 28, 2013 at 8:36 AM, aaron morton aa...@thelastpickle.comwrote: We started receiving OOMs in our cassandra grid and took a heap dump What are the JVM settings ? What was the error stack? I am pasting the serialized byte array of SliceByNamesReadCommand, which seems to be corrupt on issuing certain digest queries. Sorry I don't follow what you are saying here. Can you can you enable DEBUG logging and identify the behaviour you think is incorrect ? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 28/03/2013, at 4:15 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: We started receiving OOMs in our cassandra grid and took a heap dump. We are running version 1.0.7 with LOCAL_QUORUM from both reads/writes. After some analysis, we kind of identified the problem, with SliceByNamesReadCommand, involving a single Super-Column. This seems to be happening only in digest query and not during actual reads. I am pasting the serialized byte array of SliceByNamesReadCommand, which seems to be corrupt on issuing certain digest queries. //Type is SliceByNamesReadCommand body[0] = (byte)1; //This is a digest query here. body[1] = (byte)1; //Table-Name from 2-8 bytes //Key-Name from 9-18 bytes //QueryPath deserialization here //CF-Name from 19-30 bytes //Super-Col-Name from 31st byte onwards, but gets corrupt as found in heap dump //body[32-37] = 0, body[38] = 1, body[39] = 0. This causes the SliceByNamesDeserializer to mark both ColName=NULL and SuperColName=NULL, fetching entire wide-row!!! //Actual super-col-name starts only from byte 40, whereas it should have started from 31st byte itself Has someone already encountered such an issue? Why is the super-col-name not correctly de-serialized during digest query. -- Ravi
Re: Offsets and Range Queries
Thanks Ed, for the clarifications Yes you are correct that the apps have to handle repeatable reads and not the databases themselves when using absolute offsets, but SQL databases do provide such an option at app's peril!!! Slices have a fixed size, this ensures that the the query does not execute for arbitrary lengths of time. I assume it's because of iterators in read-time, which go over results do merging/reducing/collating results one-by-one that is not so well suited for jumping to arbitrary offsets, given the practically huge number of columns involved, right? Did I understand it correctly? We are now faced with persisting the page with both first last-key for prev/next navigation. The problem gets quickly complex, when there we have to support multiple pages per user. I just wanted to know, if there any known work-arounds for this. -- Ravi On Thu, Nov 15, 2012 at 9:03 PM, Edward Capriolo edlinuxg...@gmail.comwrote: There are several reasons. First there is no absolute offset. The rows are sorted by the data. If someone inserts new data between your query and this query the rows have changed. Unless you doing select queries inside a transaction with repeatable read and your database supports this the query you mention does not really have absolute offsets either. The results of the query can change between reads. In cassandra we do not execute large queries (that might results to temp tables or whatever) and allow you to page them. Slices have a fixed size, this ensures that the the query does not execute for arbitrary lengths of time. On Thu, Nov 15, 2012 at 6:39 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Usually we do a SELECT * FROM ORDER BY LIMIT 26,25 for pagination purpose, but specifying offset is not available for range queries in cassandra. I always have to specify a start-key to achieve this. Are there reasons for choosing such an approach rather than providing an absolute offset? -- Ravi
Re: Composite Column Types Storage
As I understand from the link below, burning column index-info onto the sstable index files will not only eliminate sstables but also reduce disk seeks from 3 to 2 for wide rows. Our index files are always mmapped, so there is only one random seek for a named column query. I think that is a wonderful improvement Shouldn't we be wary of the spike in heap usage by promoting column indexes to index file? It should be nice to have say 128th entry written out to disk, while load every 512th index in memory during start-up, just as a balancing factor? -- Ravi On Tue, Sep 18, 2012 at 4:47 PM, Sylvain Lebresne sylv...@datastax.comwrote: Range queries do not use bloom filters. It holds good for composite-columns also right? Since I assume you are referring to column's bloom filters (key's bloom filters are always used) then yes, that holds good for composite columns. Currently, composite column name are completely opaque to the storage engine. Column-part-1 alone could have gone into the bloom-filter, speeding up my queries really effectively True, though https://issues.apache.org/jira/browse/CASSANDRA-2319 (in 1.2 only however) should help quite a lot here. Basically it will allow to skip the sstable based on the column index. Granted, this is less fined grained than a bloom filter (though on the other side there is no false positive), but I suspect that in most real life workload it won't be too much worse. -- Sylvain
Re: Composite Column Types Storage
Thanks for the clarification. Even though compression solves disk space issue, we might still have Memtable bloat right? There is another issue to be handled for us. The queries are always going to be range queries with absolute match on part1 and range on part 2 of the composite columns Ex: Query some-key Column-part-1 Start-Id-part-2 Limit Range queries do not use bloom filters. It holds good for composite-columns also right? I believe I will end up writing BF bytes only to skip it later. If sharing had been possible, then Column-part-1 alone could have gone into the bloom-filter, speeding up my queries really effectively. But as I understand, there are many levels of nesting possible in a composite type and casing at every level is a big task May be casing for the top-level or the first-part should be a good start? -- Ravi On Wed, Sep 12, 2012 at 5:46 PM, Sylvain Lebresne sylv...@datastax.comwrote: Is every string/id combination stored separately in disk Yes, each combination is stored separately on disk (the storage engine itself doesn't have special casing for composite column, at least not yet). But as far as disk space is concerned, I suspect that sstable compression makes this largely a non issue. -- Sylvain
Re: GC freeze just after repair session
Our Young size=800 MB,SurvivorRatio=8,edenSize=640MB. All objects/bytes generated during compaction are garbage right? During compaction, with in_memory_compaction_limit=64MB and concurrent_compactors=8, there is a lot of pressure on ParNew sweeps. I was thinking of decreasing concurrent_compactors and in_memory_compaction_limit to go easy on GC I am not familiar with inner workings of cassandra but hope have diagnosed the problem to a little extent. On Fri, Jul 6, 2012 at 11:27 AM, rohit bhatia rohit2...@gmail.com wrote: @ravi, u can increase young gen size, keep a high tenuring rate or increase survivor ratio.. On Fri, Jul 6, 2012 at 4:03 AM, aaron morton aa...@thelastpickle.com wrote: Ideally we would like to collect maximum garbage from ParNew itself, during compactions. What are the steps to take towards to achieving this? I'm not sure what you are asking. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/07/2012, at 6:56 PM, Ravikumar Govindarajan wrote: We have modified maxTenuringThreshold from 1 to 5. May be it is causing problems. Will change it back to 1 and see how the system is. concurrent_compactors=8. We will reduce this, as anyway our system won't be able to handle this number of compactions at the same time. Think it will ease GC also to some extent. Ideally we would like to collect maximum garbage from ParNew itself, during compactions. What are the steps to take towards to achieving this? On Wed, Jul 4, 2012 at 4:07 PM, aaron morton aa...@thelastpickle.com wrote: It *may* have been compaction from the repair, but it's not a big CF. I would look at the logs to see how much data was transferred to the node. Was their a compaction going on while the GC storm was happening ? Do you have a lot of secondary indexes ? If you think it correlated to compaction you can try reducing the concurrent_compactors Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote: Recently, we faced a severe freeze [around 30-40 mins] on one of our servers. There were many mutations/reads dropped. The issue happened just after a routine nodetool repair for the below CF completed [1.0.7, NTS, DC1:3,DC2:2] Column Family: MsgIrtConv SSTable count: 12 Space used (live): 17426379140 Space used (total): 17426379140 Number of Keys (estimate): 122624 Memtable Columns Count: 31180 Memtable Data Size: 81950175 Memtable Switch Count: 31 Read Count: 8074156 Read Latency: 15.743 ms. Write Count: 2172404 Write Latency: 0.037 ms. Pending Tasks: 0 Bloom Filter False Postives: 1258 Bloom Filter False Ratio: 0.03598 Bloom Filter Space Used: 498672 Key cache capacity: 20 Key cache size: 20 Key cache hit rate: 0.9965579513062582 Row cache: disabled Compacted row minimum size: 51 Compacted row maximum size: 89970660 Compacted row mean size: 226626 Our heap config is as follows -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly from yaml in_memory_compaction_limit=64 compaction_throughput_mb_sec=8 multi_threaded_compaction=false INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is fully synced INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085 AntiEntropyService.java (line 698) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219 CompactionTask.java (line 221) Compacted to [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,]. 47,907,012 to 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s. Time: 6,186ms. After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran for every 3 seconds, while CMS ran for every 30 seconds approx continuous for 40 minutes. INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is 8506048512 . INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line 122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line 122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is 8506048512 INFO [ScheduledTasks
Re: GC freeze just after repair session
We have modified maxTenuringThreshold from 1 to 5. May be it is causing problems. Will change it back to 1 and see how the system is. concurrent_compactors=8. We will reduce this, as anyway our system won't be able to handle this number of compactions at the same time. Think it will ease GC also to some extent. Ideally we would like to collect maximum garbage from ParNew itself, during compactions. What are the steps to take towards to achieving this? On Wed, Jul 4, 2012 at 4:07 PM, aaron morton aa...@thelastpickle.comwrote: It *may* have been compaction from the repair, but it's not a big CF. I would look at the logs to see how much data was transferred to the node. Was their a compaction going on while the GC storm was happening ? Do you have a lot of secondary indexes ? If you think it correlated to compaction you can try reducing the concurrent_compactors Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/07/2012, at 6:33 PM, Ravikumar Govindarajan wrote: Recently, we faced a severe freeze [around 30-40 mins] on one of our servers. There were many mutations/reads dropped. The issue happened just after a routine nodetool repair for the below CF completed [1.0.7, NTS, DC1:3,DC2:2] Column Family: MsgIrtConv SSTable count: 12 Space used (live): 17426379140 Space used (total): 17426379140 Number of Keys (estimate): 122624 Memtable Columns Count: 31180 Memtable Data Size: 81950175 Memtable Switch Count: 31 Read Count: 8074156 Read Latency: 15.743 ms. Write Count: 2172404 Write Latency: 0.037 ms. Pending Tasks: 0 Bloom Filter False Postives: 1258 Bloom Filter False Ratio: 0.03598 Bloom Filter Space Used: 498672 Key cache capacity: 20 Key cache size: 20 Key cache hit rate: 0.9965579513062582 Row cache: disabled Compacted row minimum size: 51 Compacted row maximum size: 89970660 Compacted row mean size: 226626 Our heap config is as follows -Xms8G -Xmx8G -Xmn800M -XX:+HeapDumpOnOutOfMemoryError -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=5 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly from yaml in_memory_compaction_limit=64 compaction_throughput_mb_sec=8 multi_threaded_compaction=false INFO [AntiEntropyStage:1] 2012-06-29 09:21:26,085AntiEntropyService.java (line 762) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] MsgIrtConv is fully synced INFO [AntiEntropySessions:8] 2012-06-29 09:21:26,085AntiEntropyService.java (line 698) [repair #2b6fcbf0-c1f9-11e1--2ea8811bfbff] session completed successfully INFO [CompactionExecutor:857] 2012-06-29 09:21:31,219CompactionTask.java (line 221) Compacted to [/home/sas/system/data/ZMail/MsgIrtConv-hc-858-Data.db,]. 47,907,012 to 40,554,059 (~84% of original) bytes for 4,564 keys at 6.252080MB/s. Time: 6,186ms. After this, the logs were fully filled with GC [ParNew/CMS]. ParNew ran for every 3 seconds, while CMS ran for every 30 seconds approx continuous for 40 minutes. INFO [ScheduledTasks:1] 2012-06-29 09:23:39,921 GCInspector.java (line 122) GC for ParNew: 776 ms for 2 collections, 2901990208 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 09:23:42,265 GCInspector.java (line 122) GC for ParNew: 2028 ms for 2 collections, 3831282056 used; max is 8506048512 . INFO [ScheduledTasks:1] 2012-06-29 10:07:53,884 GCInspector.java (line 122) GC for ParNew: 817 ms for 2 collections, 2808685768 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:55,632 GCInspector.java (line 122) GC for ParNew: 1165 ms for 3 collections, 3264696776 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:57,773 GCInspector.java (line 122) GC for ParNew: 1444 ms for 3 collections, 4234372296 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:07:59,387 GCInspector.java (line 122) GC for ParNew: 1153 ms for 2 collections, 4910279080 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:08:00,389 GCInspector.java (line 122) GC for ParNew: 697 ms for 2 collections, 4873857072 used; max is 8506048512 INFO [ScheduledTasks:1] 2012-06-29 10:08:01,443 GCInspector.java (line 122) GC for ParNew: 726 ms for 2 collections, 4941511184 used; max is 8506048512 After this, the node got stable and was back and running. Any pointers will be greatly helpful
Re: MurmurHash NPE during compaction
Thanks Aaron. Created a ticket https://issues.apache.org/jira/browse/CASSANDRA-4367 Funnny thing is, I don't see any of the SSTables that participated in the failed compaction. Will do an upgradesstables and find out if problem still persists On Mon, Jun 18, 2012 at 6:43 AM, aaron morton aa...@thelastpickle.comwrote: Can you please create a ticket on https://issues.apache.org/jira/browse/CASSANDRA Please include: * CF definition including the bloom_filter_fp_chance * If the data was upgraded from a previous version of cassandra. * The names of the files that were being compacted. As a work around you can try using nodetool upgradetables to re-write the files - this may also fail, but its could be worth trying. The next step would be to remove determine which files were causing the issue (looking at the logs) and remove them from the data directory. Then run repair to restore consistency. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 14/06/2012, at 11:38 PM, Ravikumar Govindarajan wrote: We received the following NPE during compaction of a large row. We are on cassandra-1.0.7. Need some help here to find the root cause of the issue ERROR [CompactionExecutor:595] 2012-06-13 09:44:46,718 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[CompactionExecutor:595,1,main] java.lang.NullPointerException at org.apache.cassandra.utils.MurmurHash.hash64(MurmurHash.java:102) at org.apache.cassandra.utils.BloomFilter.getHashBuckets(BloomFilter.java:103) at org.apache.cassandra.utils.BloomFilter.getHashBuckets(BloomFilter.java:92) at org.apache.cassandra.utils.BloomFilter.add(BloomFilter.java:114) at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:96) at org.apache.cassandra.db.ColumnIndexer.serialize(ColumnIndexer.java:51) at org.apache.cassandra.db.compaction.PrecompactedRow.write(PrecompactedRow.java:135) at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:160) at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:159) at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:134) at org.apache.cassandra.db.compaction.CompactionManager$1.call(CompactionManager.java:114) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) Thanks and Regards, Ravi
Re: migrating from SimpleStrategy to NetworkTopologyStrategy
We tried this route previously. We did not run repair at all {our use-cases don't need a repair} but while adding a secondary data center, we were forced to run repair. It ended up exploding the data. We finally had to start afresh, scrapped the cluster and re-import the data with NTS. Now, whether we require repair or not, we are running it regularly!!! I feel that it should be alright to migrate to NTS, if you run repairs regularly and keep the cluster healthy. Regards, Ravi On Fri, Apr 20, 2012 at 2:20 AM, aaron morton aa...@thelastpickle.comwrote: There is this, it's old.. http://wiki.apache.org/cassandra/Operations#Replication There was also a discussion about it in the last month or so. i *think* it's ok so long as you move to a single DC and single rack. But please test. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/04/2012, at 5:03 AM, Marcus Both wrote: I think that is enough to do an update on keyspace, for example (cassandra-cli): update keyspace KEYSPACE with placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options = {datacenter1: 1}; On Thu, 19 Apr 2012 16:18:46 +0100 simojenki simoje...@gmail.com wrote: Hi, Is there any documentation on what the procedure for migrating from SimpleStrategy to NetworkTopologyStrategy? thanks Simon -- Marcus Both
Nodetool ring and multiple dc
Hi, I was trying to setup a backup DC from existing DC. State of existing DC with SimpleStrategy rep_factor=1. ./nodetool -h localhost ring Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 XXX.YYYDC1 RAC1Up Normal 187.69 MB 50.00% 0 XXX.ZZZ DC1 RAC1Up Normal 187.77 MB 50.00% 85070591730234615865843651857942052864 After adding backup DC with NetworkTopologyStrategy {DC1:1,DC2:1}, the output is as follows ./nodetool -h localhost ring Address DC RackStatus State LoadOwns Token 85070591730234615865843651857942052864 XXX.YYYDC1 RAC1Up Normal 187.69 MB 50.00% 0 AAA.BBBDC2 RAC1Up Normal 374.59 MB 11.99% 20392907958956928593056220689159358496 XXX.ZZZDC1 RAC1Up Normal 187.77 MB 38.01% 85070591730234615865843651857942052864 As per our app rules, all writes will first go through DC1 and then find it's way to DC2. Since the Owns percentage has drastically changed, will it mean that the DC1 nodes will become unbalanced for future writes? We have a very balanced ring in our production with all nodes serving almost equal volume data as of now in DC1. Will setting up a backup DC2 disturb the balance? Thanks and Regards, Ravi
Re: Nodetool ring and multiple dc
Thanks David, for the clarification. I feel it would be better if nodetool ring reports per-dc token space ownerships to correctly reflect what cassandra is internally doing, instead of global token space ownership. - Ravi On Fri, Feb 10, 2012 at 12:42 PM, David Schairer dschai...@humbaba.netwrote: nodetool ring is, IMHO, quite confusing in the case of multiple datacenters. Might be easier to think of it as two rings: in your DC1 ring you have two nodes, and since the tokens are balanced, assuming your rows are randomly distributed you'll have half the data on each, since your replication factor in DC is 1. In your DC2 'ring' you have one node, and with a replication factor of 1 in DC2, all data will go on that node. So you would expect to have n MB of data on XXX.YYY and XXX.ZZZ and 2n MB of data on AAA and BBB, and that's what you have, to a T. :) In other words, the fact that you injecte node AAA.BBB with a token that seems to divide the ring into uneven portions, because the DC1 ring is only DC1, it's not left unbalanced by the new node. If you added a second node to DC2 you would want to give it a token of something like 106338239662793269832304564822427565952 so that the DC2 is also evenly balanced. --DRS On Feb 9, 2012, at 11:00 PM, Ravikumar Govindarajan wrote: Hi, I was trying to setup a backup DC from existing DC. State of existing DC with SimpleStrategy rep_factor=1. ./nodetool -h localhost ring Address DC RackStatus State Load OwnsToken 85070591730234615865843651857942052864 XXX.YYYDC1 RAC1Up Normal 187.69 MB 50.00% 0 XXX.ZZZ DC1 RAC1Up Normal 187.77 MB 50.00% 85070591730234615865843651857942052864 After adding backup DC with NetworkTopologyStrategy {DC1:1,DC2:1}, the output is as follows ./nodetool -h localhost ring Address DC RackStatus State Load OwnsToken 85070591730234615865843651857942052864 XXX.YYYDC1 RAC1Up Normal 187.69 MB 50.00% 0 AAA.BBBDC2 RAC1Up Normal 374.59 MB 11.99% 20392907958956928593056220689159358496 XXX.ZZZDC1 RAC1Up Normal 187.77 MB 38.01% 85070591730234615865843651857942052864 As per our app rules, all writes will first go through DC1 and then find it's way to DC2. Since the Owns percentage has drastically changed, will it mean that the DC1 nodes will become unbalanced for future writes? We have a very balanced ring in our production with all nodes serving almost equal volume data as of now in DC1. Will setting up a backup DC2 disturb the balance? Thanks and Regards, Ravi