commitlog replay missing data
Hey all, Recently upgraded to 0.8.1 and noticed what seems to be missing data after a commitlog replay on a single-node cluster. I start the node, insert a bunch of stuff (~600MB), stop it, and restart it. There are log messages pertaining to the commitlog replay and no errors, but some of the data is missing. If I flush before stopping the node, everything is fine, and running cfstats in the two cases shows different amounts of data in the SSTables. Moreover, the amount of data that is missing is nondeterministic. Has anyone run into this? Thanks. Here is the output of a side-by-side diff between cfstats outputs for a single CF before restarting (left) and after (right). Somehow a 37MB memtable became a 2.9MB SSTable (note the difference in write count as well)? Column Family: Blocks Column Family: Blocks SSTable count: 0 | SSTable count: 1 Space used (live): 0 | Space used (live): 2907637 Space used (total): 0 | Space used (total): 2907637 Memtable Columns Count: 8198 | Memtable Columns Count: 0 Memtable Data Size: 37550510 | Memtable Data Size: 0 Memtable Switch Count: 0 | Memtable Switch Count: 1 Read Count: 0 Read Count: 0 Read Latency: NaN ms. Read Latency: NaN ms. Write Count: 8198 | Write Count: 1526 Write Latency: 0.018 ms. | Write Latency: 0.011 ms. Pending Tasks: 0Pending Tasks: 0 Key cache capacity: 20 Key cache capacity: 20 Key cache size: 0 Key cache size: 0 Key cache hit rate: NaN Key cache hit rate: NaN Row cache: disabled Row cache: disabled Compacted row minimum size: 0 | Compacted row minimum size: 1110 Compacted row maximum size: 0 | Compacted row maximum size: 2299 Compacted row mean size: 0| Compacted row mean size: 1960 Note that I patched https://issues.apache.org/jira/browse/CASSANDRA-2317 in my version, but there are no deletions involved so I don't think it's relevant unless I messed something up while patching. -Jeffrey smime.p7s Description: S/MIME cryptographic signature
hinted handoff sleeping
Hey all, We're running a slightly patched version of 0.7.3 on a cluster of 5 nodes. I've been noticing a number of messages in our logs which look like this (after a node goes down and comes back up, usually just due to a GC): 2011-06-23 14:46:35,381 INFO [HintedHandoff:1] org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHa ndOffManager.java:290) {USER='',IP=''} - Sleeping 32649ms to stagger hint delivery The interesting thing is that we have hinted_handoff_enabled = false in the YAML configuration, so it always says 0 rows are handed off (later, after the sleep). Thus this sleeping seems quite wasteful. Is this part of the code supposed to be reach even with hinted handoff disabled? Thanks. -Jeffrey smime.p7s Description: S/MIME cryptographic signature
RE: hinted handoff sleeping
No, it's always been off. No hints are being delivered ever, but the HintedHandoffManager still does some stuff when nodes come back online. -Jeffrey -Original Message- From: Ryan King [mailto:r...@twitter.com] Sent: Thursday, June 23, 2011 3:00 PM To: user@cassandra.apache.org Subject: Re: hinted handoff sleeping On Thu, Jun 23, 2011 at 2:55 PM, Jeffrey Wang jw...@palantir.com wrote: Hey all, We’re running a slightly patched version of 0.7.3 on a cluster of 5 nodes. I’ve been noticing a number of messages in our logs which look like this (after a node goes “down” and comes back up, usually just due to a GC): 2011-06-23 14:46:35,381 INFO [HintedHandoff:1] org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:290) {USER='',IP=''} - Sleeping 32649ms to stagger hint delivery The interesting thing is that we have hinted_handoff_enabled = false in the YAML configuration, so it always says 0 rows are handed off (later, after the sleep). Thus this sleeping seems quite wasteful. Is this part of the code supposed to be reach even with hinted handoff disabled? Thanks. Did you previously run with HH on? That config setting prohibits new hints from being created, but doesn't prevent existing ones from being delivered. -ryan smime.p7s Description: S/MIME cryptographic signature
multiple clusters communicating
Hey all, We're seeing a strange issue in which two completely separate clusters (0.7.3) on the same subnet (X.X.X.146 through X.X.X.150) with 3 machines (146-148) and 2 machines (149-150). Both of them are seeded with the respective machines in their cluster, yet when we run them they end up gossiping with each other. They have different cluster names so they don't merge, but this is quite annoying as schema changes don't actually go through. Anyone have any ideas about this? Thanks. -Jeffrey smime.p7s Description: S/MIME cryptographic signature
RE: pig + hadoop
Did you set PIG_RPC_PORT in your hadoop-env.sh? I was seeing this error for a while before I added that. -Jeffrey From: pob [mailto:peterob...@gmail.com] Sent: Tuesday, April 19, 2011 6:42 PM To: user@cassandra.apache.org Subject: Re: pig + hadoop Hey Aaron, I read it, and all of 3 env variables was exported. The results are same. Best, P 2011/4/20 aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Am guessing but here goes. Looks like the cassandra RPC port is not set, did you follow these steps in contrib/pig/README.txt Finally, set the following as environment variables (uppercase, underscored), or as Hadoop configuration variables (lowercase, dotted): * PIG_RPC_PORT or cassandra.thrift.port : the port thrift is listening on * PIG_INITIAL_ADDRESS or cassandra.thrift.address : initial address to connect to * PIG_PARTITIONER or cassandra.partitioner.class : cluster partitioner Hope that helps. Aaron On 20 Apr 2011, at 11:28, pob wrote: Hello, I did cluster configuration by http://wiki.apache.org/cassandra/HadoopSupport. When I run pig example-script.pig -x local, everything is fine and i get correct results. Problem is occurring with -x mapreduce Im getting those errors : 2011-04-20 01:24:21,791 [main] ERROR org.apache.pig.tools.pigstats.PigStats - ERROR: java.lang.NumberFormatException: null 2011-04-20 01:24:21,792 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed! 2011-04-20 01:24:21,793 [main] INFO org.apache.pig.tools.pigstats.PigStats - Script Statistics: Input(s): Failed to read data from cassandra://Keyspace1/Standard1 Output(s): Failed to produce result in hdfs://ip:54310/tmp/temp-1383865669/tmp-1895601791 Counters: Total records written : 0 Total bytes written : 0 Spillable Memory Manager spill count : 0 Total bags proactively spilled: 0 Total records proactively spilled: 0 Job DAG: job_201104200056_0005 - null, null- null, null 2011-04-20 01:24:21,793 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed! 2011-04-20 01:24:21,803 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open iterator for alias topnames. Backend error : java.lang.NumberFormatException: null thats from jobtasks web management - error from task directly: java.lang.RuntimeException: java.lang.NumberFormatException: null at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:123) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170) Caused by: java.lang.NumberFormatException: null at java.lang.Integer.parseInt(Integer.java:417) at java.lang.Integer.parseInt(Integer.java:499) at org.apache.cassandra.hadoop.ConfigHelper.getRpcPort(ConfigHelper.java:233) at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:105) ... 5 more Any suggestions where should be problem? Thanks,
DatabaseDescriptor.defsVersion
Hey all, I've been seeing a very rare issue with schema change conflicts on 0.7.3 (I am serializing all schema changes to a single Cassandra node and waiting for them to finish before continuing). Occasionally a node in the cluster will never report the correct schema, and I think it may have to do with synchronization on DatabaseDescriptor.defsVersion. As far as I can tell, it is a static variable accessed by multiple threads but is not protected by synchronized/volatile. I was able to write a test in which one thread never reads the modification done by another thread (as is expected by an unsynchronized variable). Should this be fixed or is there a higher level reason this does not need to be synchronized (in which case I should continue looking for the reason why my schemas don't agree)? Thanks. -Jeffrey
RE: DatabaseDescriptor.defsVersion
Done: https://issues.apache.org/jira/browse/CASSANDRA-2490 -Jeffrey -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Friday, April 15, 2011 7:39 PM To: user@cassandra.apache.org Cc: Jeffrey Wang Subject: Re: DatabaseDescriptor.defsVersion I think you found a bug; it should be volatile. (Cassandra does already make sure that only one change runs internally at a time.) Can you create a ticket? On Fri, Apr 15, 2011 at 6:04 PM, Jeffrey Wang jw...@palantir.com wrote: Hey all, I've been seeing a very rare issue with schema change conflicts on 0.7.3 (I am serializing all schema changes to a single Cassandra node and waiting for them to finish before continuing). Occasionally a node in the cluster will never report the correct schema, and I think it may have to do with synchronization on DatabaseDescriptor.defsVersion. As far as I can tell, it is a static variable accessed by multiple threads but is not protected by synchronized/volatile. I was able to write a test in which one thread never reads the modification done by another thread (as is expected by an unsynchronized variable). Should this be fixed or is there a higher level reason this does not need to be synchronized (in which case I should continue looking for the reason why my schemas don't agree)? Thanks. -Jeffrey -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
RE: pig counting question
I don't think it's Pig running out of memory, but rather Cassandra itself (the data doesn't even make it to Pig). get_range_slices() is called with a row batch size of 4096, the default, and it's fetching all of the columns in each row. If I have 10K columns in each row, that's a huge request, and Cassandra runs into memory pressure trying to serve it. That's my understanding of it; if there's something I'm missing, please let me know. -Jeffrey -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Friday, March 25, 2011 11:06 AM To: user@cassandra.apache.org Subject: Re: pig counting question One thing I wonder though - if your columns are the thing that are increasing your heap size and eating up a lot of memory, and you're reading the data structure out as a bag of columns, why isn't pig spilling to disk instead of growing in memory. The pig model is that you can have huge bags that don't kill you on memory but they are just slower because they spill to disk. What is the schema that you impose when you load the data? On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote: It looks like this functionality is not in the 0.7.3 version of CassandraStorage. I tried to add the constructor which takes the limit to the class, but I ran into some Pig parsing errors, so I had to make the parameter a string. How did you get around this for the version of CassandraStorage in trunk? I'm running Pig 0.8.0. Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra starts eating up huge amounts of memory, maxing out my 16GB heap size. I suspect this is because of the get_range_slices() call from ColumnFamilyRecordReader. Are there plans to make this streaming/paged? -Jeffrey -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Thursday, March 24, 2011 11:34 AM To: user@cassandra.apache.org Subject: Re: pig counting question The limit defaults to 1024 but you can set it when you use CassandraStorage in pig, like so: rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096); or whatever value you wish. Give that a try and see if it gives you more of what you're looking for. On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote: Hey all, I'm trying to run a very simple Pig script against my Cassandra cluster (5 nodes, 0.7.3). I've gotten it all set up and working, but the script is giving me some strange results. Here is my script: rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(); rowct = FOREACH rows GENERATE $0, COUNT($1); dump rowct; If I understand Pig correctly, this should output (row name, column count) tuples, but I'm always seeing 1024 for the column count even though the rows have highly variable number of columns. Am I missing something? Thanks. -Jeffrey
RE: pig counting question
Just to be clear, it's also the case that if I have a Hadoop TaskTracker running on each node that Cassandra is running on, a map/reduce job will automatically handle data locality, right? I.e. each mapper will only read splits which live on the same box. -Jeffrey -Original Message- From: Jeffrey Wang [mailto:jw...@palantir.com] Sent: Friday, March 25, 2011 11:42 AM To: user@cassandra.apache.org Subject: RE: pig counting question I don't think it's Pig running out of memory, but rather Cassandra itself (the data doesn't even make it to Pig). get_range_slices() is called with a row batch size of 4096, the default, and it's fetching all of the columns in each row. If I have 10K columns in each row, that's a huge request, and Cassandra runs into memory pressure trying to serve it. That's my understanding of it; if there's something I'm missing, please let me know. -Jeffrey -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Friday, March 25, 2011 11:06 AM To: user@cassandra.apache.org Subject: Re: pig counting question One thing I wonder though - if your columns are the thing that are increasing your heap size and eating up a lot of memory, and you're reading the data structure out as a bag of columns, why isn't pig spilling to disk instead of growing in memory. The pig model is that you can have huge bags that don't kill you on memory but they are just slower because they spill to disk. What is the schema that you impose when you load the data? On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote: It looks like this functionality is not in the 0.7.3 version of CassandraStorage. I tried to add the constructor which takes the limit to the class, but I ran into some Pig parsing errors, so I had to make the parameter a string. How did you get around this for the version of CassandraStorage in trunk? I'm running Pig 0.8.0. Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra starts eating up huge amounts of memory, maxing out my 16GB heap size. I suspect this is because of the get_range_slices() call from ColumnFamilyRecordReader. Are there plans to make this streaming/paged? -Jeffrey -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Thursday, March 24, 2011 11:34 AM To: user@cassandra.apache.org Subject: Re: pig counting question The limit defaults to 1024 but you can set it when you use CassandraStorage in pig, like so: rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096); or whatever value you wish. Give that a try and see if it gives you more of what you're looking for. On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote: Hey all, I'm trying to run a very simple Pig script against my Cassandra cluster (5 nodes, 0.7.3). I've gotten it all set up and working, but the script is giving me some strange results. Here is my script: rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(); rowct = FOREACH rows GENERATE $0, COUNT($1); dump rowct; If I understand Pig correctly, this should output (row name, column count) tuples, but I'm always seeing 1024 for the column count even though the rows have highly variable number of columns. Am I missing something? Thanks. -Jeffrey
RE: pig counting question
It looks like this functionality is not in the 0.7.3 version of CassandraStorage. I tried to add the constructor which takes the limit to the class, but I ran into some Pig parsing errors, so I had to make the parameter a string. How did you get around this for the version of CassandraStorage in trunk? I'm running Pig 0.8.0. Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra starts eating up huge amounts of memory, maxing out my 16GB heap size. I suspect this is because of the get_range_slices() call from ColumnFamilyRecordReader. Are there plans to make this streaming/paged? -Jeffrey -Original Message- From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] Sent: Thursday, March 24, 2011 11:34 AM To: user@cassandra.apache.org Subject: Re: pig counting question The limit defaults to 1024 but you can set it when you use CassandraStorage in pig, like so: rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096); or whatever value you wish. Give that a try and see if it gives you more of what you're looking for. On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote: Hey all, I'm trying to run a very simple Pig script against my Cassandra cluster (5 nodes, 0.7.3). I've gotten it all set up and working, but the script is giving me some strange results. Here is my script: rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(); rowct = FOREACH rows GENERATE $0, COUNT($1); dump rowct; If I understand Pig correctly, this should output (row name, column count) tuples, but I'm always seeing 1024 for the column count even though the rows have highly variable number of columns. Am I missing something? Thanks. -Jeffrey
RE: running all unit tests
Awesome, thanks. I'm seeing some weird errors due to deleting commit logs, though (I'm running on Windows, which might have something to do with it): [junit] java.io.IOException: Failed to delete C:\Documents and Settings\jwang\workspace-cass\Cassandra\Cassandra-0.7.0\build\test\cassandra\commitlog\CommitLog-1300214497376.log [junit] at org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54) [junit] at org.apache.cassandra.io.util.FileUtils.deleteRecursive(FileUtils.java:201) [junit] at org.apache.cassandra.io.util.FileUtils.deleteRecursive(FileUtils.java:197) [junit] at org.apache.cassandra.CleanupHelper.cleanup(CleanupHelper.java:55) [junit] at org.apache.cassandra.CleanupHelper.cleanupAndLeaveDirs(CleanupHelper.java:41) Does anyone know how to get these to work? -Jeffrey From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Tuesday, March 15, 2011 1:26 AM To: user@cassandra.apache.org Subject: Re: running all unit tests There is a test target in the build script. Aron On 15 Mar 2011, at 17:29, Jeffrey Wang wrote: Hey all, We're applying some patches to our own branch of Cassandra, and we are wondering if there is a good way to run all the unit tests. Just having JUnit run all the test classes seems to result in a lot of errors that are hard to fix, so I'm hoping there's an easy way to do this. Thanks! -Jeffrey
running all unit tests
Hey all, We're applying some patches to our own branch of Cassandra, and we are wondering if there is a good way to run all the unit tests. Just having JUnit run all the test classes seems to result in a lot of errors that are hard to fix, so I'm hoping there's an easy way to do this. Thanks! -Jeffrey
get_range_slices perf
Hey all, I'm trying to get a list of all the rows from a column family using get_range_slices retrieving no actual columns. I expected this operation to be pretty quick, but it seems to take a while (5-node 0.7.0 cluster takes 20 min to page through 60k keys 1000 at a time). It's not completely clear to me from the code, but is there a lot of SSTable reading involved when getting just the row names? And is this the best way to read all of the row names in a CF? Thanks. -Jeffrey
understanding tombstones
Hey all, I was wondering if this is the expected behavior of deletes (0.7.0). Let's say I have a 1-node cluster with a single CF which has gc_grace_seconds = 0. The following sequence of operations happens (in the given order): insert row X with timestamp T delete row X with timestamp T+1 force flush + compaction insert row X with timestamp T My understanding is that the tombstone created by the delete (and row X) will disappear with the flush + compaction which means the last insertion should show up. My experimentation, however, suggests otherwise (the last insertion does not show up). I believe I have traced this to the fact that the markedForDeleteAt field on the ColumnFamily does not get reset after a compaction (after gc_grace_seconds has passed); is this desirable? I think it introduces an inconsistency in how tombstoned columns work versus tombstoned CFs. Thanks. -Jeffrey
RE: understanding tombstones
Yup. https://issues.apache.org/jira/browse/CASSANDRA-2305 -Jeffrey -Original Message- From: Jonathan Ellis [mailto:jbel...@gmail.com] Sent: Wednesday, March 09, 2011 6:19 PM To: user@cassandra.apache.org Subject: Re: understanding tombstones On Wed, Mar 9, 2011 at 4:54 PM, Jeffrey Wang jw...@palantir.com wrote: insert row X with timestamp T delete row X with timestamp T+1 force flush + compaction insert row X with timestamp T My understanding is that the tombstone created by the delete (and row X) will disappear with the flush + compaction which means the last insertion should show up. Right. I believe I have traced this to the fact that the markedForDeleteAt field on the ColumnFamily does not get reset after a compaction (after gc_grace_seconds has passed); is this desirable? I think it introduces an inconsistency in how tombstoned columns work versus tombstoned CFs. Thanks. That does sound like a bug. Can you create a ticket? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
when do snapshots go away?
Hi all, When I drop a column family, it creates a snapshot. When does the snapshot go away and free up the disk space? I was able to run nodetool clearsnapshot to get rid of them, but will they go away themselves? (Also, is there a purpose to keeping a snapshot around?) -Jeffrey
RE: memtable_flush_after_mins setting not working
I just noticed this thread. Does this mean that (assuming the same setup of an empty keyspace and CFs added later) if I have a CF that I write to for some time, but not enough to hit the flush limits, it will never get flushed until the server is restarted? I believe this is causing commit logs to not be deleted, which is taking up a ton of disk space (in addition to a bunch of small memtables being stuck in memory). -Jeffrey From: Ching-Cheng Chen [mailto:cc...@evidentsoftware.com] Sent: Thursday, February 17, 2011 8:52 AM To: user@cassandra.apache.org Cc: Jonathan Ellis Subject: Re: memtable_flush_after_mins setting not working https://issues.apache.org/jira/browse/CASSANDRA-2183 Regards, Chen www.evidentsoftware.comhttp://www.evidentsoftware.com On Thu, Feb 17, 2011 at 11:47 AM, Ching-Cheng Chen cc...@evidentsoftware.commailto:cc...@evidentsoftware.com wrote: Certainly, I'll open a ticket to track this issue. Regards, Chen www.evidentsoftware.comhttp://www.evidentsoftware.com On Thu, Feb 17, 2011 at 11:42 AM, Jonathan Ellis jbel...@gmail.commailto:jbel...@gmail.com wrote: Your analysis sounds correct to me. Can you open a ticket on https://issues.apache.org/jira/browse/CASSANDRA ? On Thu, Feb 17, 2011 at 10:17 AM, Ching-Cheng Chen cc...@evidentsoftware.commailto:cc...@evidentsoftware.com wrote: We have observed the behavior that memtable_flush_after_mins setting not working occasionally. After some testing and code digging, we finally figured out what going on. The memtable_flush_after_mins won't work on certain condition with current implementation in Cassandra. In org.apache.cassandra.db.Table, the scheduled flush task is setup by the following code during construction. int minCheckMs = Integer.MAX_VALUE; for (ColumnFamilyStore cfs : columnFamilyStores.values()) { minCheckMs = Math.min(minCheckMs, cfs.getMemtableFlushAfterMins() * 60 * 1000); } Runnable runnable = new Runnable() { public void run() { for (ColumnFamilyStore cfs : columnFamilyStores.values()) { cfs.forceFlushIfExpired(); } } }; flushTask = StorageService.scheduledTasks.scheduleWithFixedDelay(runnable, minCheckMs, minCheckMs, TimeUnit.MILLISECONDS); Now for our application, we will create a keyspacewithout any columnfamily first. And only add needed columnfamily later depends on request. However, when keyspacegot created (without any columnfamily ), the above code will actually schedule a fixed delay flush check task with Integer.MAX_VALUE ms since there is no columnfamily yet. Later when you add columnfamily to this empty keyspace, the initCf() method in Table.java doesn't check whether the scheduled flush check task interval need to be updated or not. To fix this, we'd need to restart the Cassandra after columnfamily added into the keyspace. I would suggest that add additional logic in initCf() method to recreate a scheduled flush check task if needed. Regards, Chen www.evidentsoftware.comhttp://www.evidentsoftware.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- www.evidentsoftware.comhttp://www.evidentsoftware.com
dropped mutations, UnavailableException, and long GC
Hey all, Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB disk each collocated in a DC. We're doing bulk imports from each of the nodes with RF = 2 and write consistency ANY (write perf is very important). The behavior we're seeing is this: - Nodes often see each other as dead even though none of the nodes actually go down. I suspect this may be due to long GCs. It seems like increasing the RPC timeout could help this, but I'm not convinced this is the root of the problem. Note that in this case writes return with the UnavailableException. - As mentioned, long GCs. We see the ParNew GC doing a lot of smaller collections (few hundred MB) which are very fast (few hundred ms), but every once in a while the ConcurrentMarkSweep will take a LONG time (up to 15 min!) to collect upwards of 15GB at once. - On some nodes, we see a lot of pending MutationStages build up (e.g. 500K), which leads to the messages Dropped X MUTATION messages in the last 5000ms, presumably meaning that Cassandra has decided to not write one of the replicas of the data. This is not a HUGE deal, but is less than ideal. - The end result is that a bunch of writes end up failing due to the UnavailableExceptions, so not all of our data is getting into Cassandra. So my question is: what is the best way to avoid this behavior? Our memtable thresholds are fairly low (256MB) so there should be plenty of heap space to work with. We may experiment with write consistency ONE or ALL to see if the perf hit is not too bad, but I wanted to get some opinions on why this might be happening. Thanks! -Jeffrey
RE: rolling window of data
Thanks for the response, but unfortunately a TTL is not enough for us. We would like to be able to dynamically control the window in case there is an unusually large amount of data or something so we don't run out of disk space. One question I have in particular is: if I use the timestamp of my log entries (not necessarily correlated at all with the timestamp of insert) as the timestamp on my mutations will Cassandra do the right thing when I delete? We don't have any need for conflict resolution, so we are currently just using the current time. It seems like there is a possibility, depending on the implementation details of Cassandra, that I could call a remove with a timestamp for which everything before that should get deleted. Like I said before, this seems a bit hacky to me, but would it get the job done? -Jeffrey -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: Thursday, February 03, 2011 8:48 AM To: user@cassandra.apache.org Subject: Re: rolling window of data The correct way to accomplish what you describe is the new (in 0.7) per-column TTL. Simply set this to 60 * 60 * 24 * 90 (90 day's worth of seconds) and your columns will magically disappear after that length of time. Although that assumes it's okay to loose data or that there is some other method in place to prevent loss of it should the data not be processed to whatever extent is required. TTL:s would be a great way to efficiently achieve the windowing, but it does remove the ability to explicitly control exactly when data is removed (such as after certain batch processing of it has completed). -- / Peter Schuller
RE: rolling window of data
To be a little more clear, a simplified version of what I'm asking is: Let's say you add 1K columns with timestamps 1 to 1000. Then, at an arbitrarily distant point in the future, if you call remove on that CF with timestamp 500 (so the timestamps are logically out of order), will it delete exactly half of it or is there stuff that might go on under the covers that makes this not work as you might expect? -Jeffrey -Original Message- From: Jeffrey Wang [mailto:jw...@palantir.com] Sent: Thursday, February 03, 2011 3:03 PM To: user@cassandra.apache.org Subject: RE: rolling window of data Thanks for the response, but unfortunately a TTL is not enough for us. We would like to be able to dynamically control the window in case there is an unusually large amount of data or something so we don't run out of disk space. One question I have in particular is: if I use the timestamp of my log entries (not necessarily correlated at all with the timestamp of insert) as the timestamp on my mutations will Cassandra do the right thing when I delete? We don't have any need for conflict resolution, so we are currently just using the current time. It seems like there is a possibility, depending on the implementation details of Cassandra, that I could call a remove with a timestamp for which everything before that should get deleted. Like I said before, this seems a bit hacky to me, but would it get the job done? -Jeffrey -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: Thursday, February 03, 2011 8:48 AM To: user@cassandra.apache.org Subject: Re: rolling window of data The correct way to accomplish what you describe is the new (in 0.7) per-column TTL. Simply set this to 60 * 60 * 24 * 90 (90 day's worth of seconds) and your columns will magically disappear after that length of time. Although that assumes it's okay to loose data or that there is some other method in place to prevent loss of it should the data not be processed to whatever extent is required. TTL:s would be a great way to efficiently achieve the windowing, but it does remove the ability to explicitly control exactly when data is removed (such as after certain batch processing of it has completed). -- / Peter Schuller
rolling window of data
Hi, We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it's not clear if there is an efficient way to delete data that is more than 90 days old. Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra's timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks. -Jeffrey
RE: rolling window of data
Thanks for the link, but unfortunately it doesn't look like it uses a rolling window. As far as I can tell, log entries just keep getting inserted into Cassandra. -Jeffrey From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, February 02, 2011 9:21 PM To: user@cassandra.apache.org Subject: Re: rolling window of data This project may provide some inspiration for you https://github.com/thobbs/logsandra Not sure if it has a rolling window, if you find out let me know :) Aaron On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang jw...@palantir.com wrote: Hi, We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it's not clear if there is an efficient way to delete data that is more than 90 days old. Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra's timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks. -Jeffrey