Re: Limit what nodes are writeable
Cassandra has authentication interface, but doesn't have authorization. So you need to implement authorization in your application layer. maki 2011/7/11 David McNelis dmcne...@agentisenergy.com: I've been looking in the documentation and haven't found anything about this... but is there support for making a node read-only? For example, you have a cluster set up in two different data centers / racks / whatever, with your replication strategy set up so that the data is redundant between the two places. In one of the places all of the incoming data will be processed and inserted into your cluster. In the other data center you plan to allow people to run analytics, but you want to restrict the permissions so that the people running analytics can connect to Cassandra in whatever way makes the most sense for them, but you don't want those people to be able to edit/update data. Is it currently possible to configure your cluster in this manner? Or would it only be possible through a third-party solution like wrapping one of the access libraries in a way that does not support write operations. -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource. -- w3m
Re: Storing counters in the standard column families along with non-counter columns ?
On 07/10/2011 01:09 PM, Aditya Narayan wrote: Is there any target version in near future for which this has been promised ? The ticket is problematic in that it would -- unless someone has a clever new idea -- require breaking thrift compatibility to add it to the api. Since is unfortunate since it would be so useful. If it's in the 0.8.x series it will only be through CQL.
Re: Limit what nodes are writeable
I never used the feature, but there is the way to control access based on user name. Configuring both conf/passwd.properties and conf/access.properties, then modify cassandra.yaml as follows. # authentication backend, implementing IAuthenticator; used to identify users authenticator: org.apache.cassandra.auth.SimpleAuthenticator # authorization backend, implementing IAuthority; used to limit access/provide permissions authority: org.apache.cassandra.auth.SimpleAuthority 2011/7/11 Maki Watanabe watanabe.m...@gmail.com: Cassandra has authentication interface, but doesn't have authorization. So you need to implement authorization in your application layer. maki 2011/7/11 David McNelis dmcne...@agentisenergy.com: I've been looking in the documentation and haven't found anything about this... but is there support for making a node read-only? For example, you have a cluster set up in two different data centers / racks / whatever, with your replication strategy set up so that the data is redundant between the two places. In one of the places all of the incoming data will be processed and inserted into your cluster. In the other data center you plan to allow people to run analytics, but you want to restrict the permissions so that the people running analytics can connect to Cassandra in whatever way makes the most sense for them, but you don't want those people to be able to edit/update data. Is it currently possible to configure your cluster in this manner? Or would it only be possible through a third-party solution like wrapping one of the access libraries in a way that does not support write operations. -- David McNelis Lead Software Engineer Agentis Energy www.agentisenergy.com o: 630.359.6395 c: 219.384.5143 A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource. -- w3m -- Yuki Morishita t:yukim (http://twitter.com/yukim)
AntiEntropy?
I looked around in the code, it seems that AntiEntropy operations are not automatically run in the server daemon, but only manually invoked through nodetool, am I correct? if this is the case, I guess the reason to disable it is just the load impact it brings to servers? Thanks Yang
Re: AntiEntropy?
I looked around in the code, it seems that AntiEntropy operations are not automatically run in the server daemon, but only manually invoked through nodetool, am I correct? Yes, and it's important that you do run repair: http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair -- / Peter Schuller
Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'
Instead of doing nodetool repair, is it not a cheaper operation to keep tab of failed writes (be it deletes or inserts or updates) and read these failed writes at a set frequency in some batch job ? By reading them, RR would get triggered and they would get to a consistent state. Because these would targeted reads (only for those that failed during writes), it should be a shorter list and quick to repair (than nodetool repair). On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Read repair does NOT repair tombstones. It does, but you can't rely on RR to repair _all_ tombstones, because RR only happens if the row in question is requested by a client. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Storing counters in the standard column families along with non-counter columns ?
Oops that's really very much disheartening and it could seriously impact our plans for going live in near future. Without this facility I guess counters currently have very little usefulness. On Mon, Jul 11, 2011 at 8:16 PM, Chris Burroughs chris.burrou...@gmail.comwrote: On 07/10/2011 01:09 PM, Aditya Narayan wrote: Is there any target version in near future for which this has been promised ? The ticket is problematic in that it would -- unless someone has a clever new idea -- require breaking thrift compatibility to add it to the api. Since is unfortunate since it would be so useful. If it's in the 0.8.x series it will only be through CQL.
Secondary Index doesn't work with LOCAL_QUORUM
Hi, We're using Cassandra with 2 DC - one OLTP Cassandra, 6 nodes, with RF3 - the other is a Brisk, 3 nodes, with RF1 We noticed that when I do a write-then-read operation on the Cassandra DC, it fails with the following information (from cqlsh): Unable to complete request: one or more nodes were unavailable. - write: LOCAL_QUORUM, successful - read: LOCAL_QUORUM, using the secondary indexed column, fail Seems it's taking a long while working on this. When I retry the same query after ~10 minutes, it succeeds actually. Any help is appreciated. Thanks, Hefeng
Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'
Never mind. I see the issue with this. I will be able to catch the writes as failed only if I set CL=ALL. For other CLs, I may not know that it failed on some node. On Mon, Jul 11, 2011 at 2:33 PM, A J s5a...@gmail.com wrote: Instead of doing nodetool repair, is it not a cheaper operation to keep tab of failed writes (be it deletes or inserts or updates) and read these failed writes at a set frequency in some batch job ? By reading them, RR would get triggered and they would get to a consistent state. Because these would targeted reads (only for those that failed during writes), it should be a shorter list and quick to repair (than nodetool repair). On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Read repair does NOT repair tombstones. It does, but you can't rely on RR to repair _all_ tombstones, because RR only happens if the row in question is requested by a client. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
custom StoragePort?
I tried to run multiple cassandra daemons on the same host, using different ports, for a test env. I thought this would work, but it turns out that the StoragePort used by outputTcpConnection is always assumed to be the one specified in .yaml, i.e. the code assumes that the storageport is same everywhere. in fact this assumption seems deeply held in many places in the code, so it's a bit difficult to refactor it , for example by substituting InetAddress with InetSocketAddress. I am just wondering, do you see any other value to a custom storageport, besides testing? if there is real value, maybe someone more familiar with the code could do the refactoring Thanks yang
Node repair questions
Hello, Have the following questions related to nodetool repair: 1. I know that Nodetool Repair Interval has to be less than GCGraceSeconds. How do I come up with an exact value of GCGraceSeconds and 'Nodetool Repair Interval'. What factors would want me to change the default of 10 days of GCGraceSeconds. Similarly what factors would want me to keep Nodetool Repair Interval to be just slightly less than GCGraceSeconds (say a day less). 2. Does a Nodetool Repair block any reads and writes on the node, while the repair is going on ? During repair, if I try to do an insert, will the insert wait for repair to complete first ? 3. I read that repair can impact your workload as it causes additional disk and cpu activity. But any details of the impact mechanism and any ballpark on how much the read/write performance deteriorates ? Thanks.
Re: custom StoragePort?
never mind, found this.. https://issues.apache.org/jira/browse/CASSANDRA-200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel On Mon, Jul 11, 2011 at 12:39 PM, Yang tedd...@gmail.com wrote: I tried to run multiple cassandra daemons on the same host, using different ports, for a test env. I thought this would work, but it turns out that the StoragePort used by outputTcpConnection is always assumed to be the one specified in .yaml, i.e. the code assumes that the storageport is same everywhere. in fact this assumption seems deeply held in many places in the code, so it's a bit difficult to refactor it , for example by substituting InetAddress with InetSocketAddress. I am just wondering, do you see any other value to a custom storageport, besides testing? if there is real value, maybe someone more familiar with the code could do the refactoring Thanks yang
Out of memory error in cassandra
Hi All, I am getting following error from cassandra: ERROR [ReadStage:23] 2011-07-10 17:19:18,300 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:49) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:30) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:117) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:94) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:107) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:72) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1311) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1203) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1131) at org.apache.cassandra.db.Table.getRow(Table.java:333) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:69) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) INFO [ScheduledTasks:1] 2011-07-10 17:19:18,306 StatusLogger.java (line 66) RequestResponseStage 0 0 ERROR [ReadStage:23] 2011-07-10 17:19:18,306 AbstractCassandraDaemon.java (line 114) Fatal exception in thread Thread[ReadStage:23,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:49) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:30) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:117) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:94) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:107) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:72) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1311) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1203) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1131) Can someone please help debug this? The maximum heap size is 28G . I am not sure why cassandra is giving Out of memory error here. Thanks Anurag
Re: Out of memory error in cassandra
Are you on a 64 bit VM? A 32 bit vm will basically ignore any setting over 2GB On Mon, Jul 11, 2011 at 4:55 PM, Anurag Gujral anurag.guj...@gmail.comwrote: Hi All, I am getting following error from cassandra: ERROR [ReadStage:23] 2011-07-10 17:19:18,300 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:49) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:30) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:117) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:94) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:107) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:72) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1311) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1203) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1131) at org.apache.cassandra.db.Table.getRow(Table.java:333) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:69) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) INFO [ScheduledTasks:1] 2011-07-10 17:19:18,306 StatusLogger.java (line 66) RequestResponseStage 0 0 ERROR [ReadStage:23] 2011-07-10 17:19:18,306 AbstractCassandraDaemon.java (line 114) Fatal exception in thread Thread[ReadStage:23,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:49) at org.apache.cassandra.utils.BloomFilterSerializer.deserialize(BloomFilterSerializer.java:30) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:117) at org.apache.cassandra.io.sstable.IndexHelper.defreezeBloomFilter(IndexHelper.java:94) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:107) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:72) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1311) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1203) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1131) Can someone please help debug this? The maximum heap size is 28G . I am not sure why cassandra is giving Out of memory error here. Thanks Anurag -- It's always darkest just before you are eaten by a grue.
RE: custom StoragePort?
If you are on linux see: https://github.com/pcmanus/ccm -Original Message- From: Yang [mailto:tedd...@gmail.com] Sent: Monday, July 11, 2011 3:08 PM To: user@cassandra.apache.org Subject: Re: custom StoragePort? never mind, found this.. https://issues.apache.org/jira/browse/CASSANDRA-200?page=com.atlassian.j ira.plugin.system.issuetabpanels:all-tabpanel On Mon, Jul 11, 2011 at 12:39 PM, Yang tedd...@gmail.com wrote: I tried to run multiple cassandra daemons on the same host, using different ports, for a test env. I thought this would work, but it turns out that the StoragePort used by outputTcpConnection is always assumed to be the one specified in .yaml, i.e. the code assumes that the storageport is same everywhere. in fact this assumption seems deeply held in many places in the code, so it's a bit difficult to refactor it , for example by substituting InetAddress with InetSocketAddress. I am just wondering, do you see any other value to a custom storageport, besides testing? if there is real value, maybe someone more familiar with the code could do the refactoring Thanks yang
RE: Node repair questions
The more often you repair, the quicker it will be. The more often your nodes go down the longer it will be. Repair streams data that is missing between nodes. So the more data that is different the longer it will take. Your workload is impacted because the node has to scan the data it has to be able to compare with other nodes, and if there are differences, it has to send/receive data from other nodes. -Original Message- From: A J [mailto:s5a...@gmail.com] Sent: Monday, July 11, 2011 2:43 PM To: user@cassandra.apache.org Subject: Node repair questions Hello, Have the following questions related to nodetool repair: 1. I know that Nodetool Repair Interval has to be less than GCGraceSeconds. How do I come up with an exact value of GCGraceSeconds and 'Nodetool Repair Interval'. What factors would want me to change the default of 10 days of GCGraceSeconds. Similarly what factors would want me to keep Nodetool Repair Interval to be just slightly less than GCGraceSeconds (say a day less). 2. Does a Nodetool Repair block any reads and writes on the node, while the repair is going on ? During repair, if I try to do an insert, will the insert wait for repair to complete first ? 3. I read that repair can impact your workload as it causes additional disk and cpu activity. But any details of the impact mechanism and any ballpark on how much the read/write performance deteriorates ? Thanks.
Re: Cassandra Secondary index/Twissandra
Hi Aaron, Thank you again for your response. I've read the article but I didn't understand everything. it would be great if the benchmark will include the actual CLI/Python comments (that way it will be easier to understand the query). in addition, an explanation about row pages - what is it?. Anyway, for a scale proportion, we can take as example the average Facebook/Twitter user which can get 100K columns per user (Userline). So what is needed is to take the first 50 columns (order by TimeUUID), then column 51 to 100, 101 to 150 etc. Any suggestion on fast will it be? or how you recommend on configuring Cassandra? or even a different way of achieving that goal? Thanks, Eldad. On Sun, Jul 10, 2011 at 8:31 PM, aaron morton aa...@thelastpickle.comwrote: Can you recommend on a better way of doing that or a way to tune Cassandra to support those 2 CF? A select with no start or finish column name, a column count and not in reversed order is about the fastest read query. You will need to do a reversed query, which will be a little slower. But may still be plenty fast enough, depending on scale and throughput and all those other things. see http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 10 Jul 2011, at 00:14, Eldad Yamin wrote: Aaron - Thank you for the fast response! 1. Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. In Twissandra, the columns are used as alternative index for the Userline/Timeline. therefore the operation I'm going to do is slice_range. I'm going to get (for example) the first 50 columns (using comparator of TimeUUID/LONG). Can you recommend on a better way of doing that or a way to tune Cassandra to support those 2 CF? Thanks! On Sun, Jul 10, 2011 at 3:26 AM, aaron morton aa...@thelastpickle.comwrote: 1. Is there a limit on the number of columns in a single column family that serve as secondary indexes? AFAIK there is no coded limit, however every index is implemented as another (hidden) Column Family that inherits the settings of the parent CF. So under 0.7 you may run out of memory, under 0.8 you may flush a lot. Also, when an indexed column is updated there are potentially 3 operations that have to happen: read the old value, delete the old value, write the new value. More indexes == more index updating, just like any other database. 1. Does performance decrease (significantly) if the uniqueness of the column’s values is high? Low cardinality is recommended http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Secondary-indices-Why-low-cardinality-td6160509.html 1. The CF for Userline/Uimeline - have comparator of LONG_TYPE and not TimeUUID? Probably just to make the demo easier. It's used to order tweets in the user and public timelines by the current time https://github.com/twissandra/twissandra/blob/master/cass.py#L204 1. Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Depends on what sort of operations you are doing. Some read operations have to pay a constant cost to decode the row level column index, this can be tuned though. AFAIK the comparator type has very little to do with the performance. Hope that helps. - - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 9 Jul 2011, at 12:15, Eldad Yamin wrote: Hi, I have few questions: *Secondary index* 1. Is there a limit on the number of columns in a single column family that serve as secondary indexes? 2. Does performance decrease (significantly) if the uniqueness of the column’s values is high? *Twissandra* 1. Why in the source (or any tutorial I've read): The CF for Userline/Uimeline - have comparator of LONG_TYPE and not TimeUUID? https://github.com/twissandra/twissandra/blob/master/tweets/management/commands/sync_cassandra.py 2. Does performance decrease (significantly) if the uniqueness of the column’s name is high when comparator is LONG_TYPE/TimeUUID and each row has lots of columns? Thanks! Eldad
Re: Strong Consistency with ONE read/writes
I'm not proposing any changes to be done, but this looks like a very interesting topic for thought/hack/learning, so the following are only for thought exercises HBase enforces a single write/read entry point, so you can achieve strong consistency by writing/reading only one node. but just writing to one node exposes you to loss of data if that node fails. so the region server HLog is replicated to 3 HDFS data nodes. the interesting thing here is that each replica sees a complete *prefix* of the HLog: it won't miss a record, if a record sync() to a data node fails, all the existing bytes in the block are replicated to a new data node. if we employ a similar leader node among the N replicas of cassandra (coordinator always waits for the reply from leader, but leader does not do further replication like in HBase or counters), the leader sees all writes onto the key range, but the other replicas could miss some writes, as a result, each of the non-leader replicas' write history has some holes, so when the leader dies, and when we elect a new one, no one is going to have a complete history. so you'd have to do a repair amongst all the replicas to reconstruct the full history, which is slow. it seems possible that we could utilize the FIFO property of the InComingTCPConnection to simplify history reconstruction, just like Zookeeper. if the IncomingTcpConnection of a replica fails, that means that it may have missed some edits, then when it reconnects, we force it to talk to the active leader first, to catch up to date. when the leader dies, the next leader is elected to be the replica with the most recent history. by maintaining the property that each node has a complete prefix of history, we only need to catch up on the tail of history, and avoid doing a complete repair on the entire memtable+SStable. but one issue is that the history at the leader has to be kept really long - if a non-leader replica goes off for 2 days, the leader has to keep all the history for 2 days to feed them to the replica when it comes back online. but possibly this could be limited to some max length so that over that length, the woken replica simply does a complete bootstrap. thanks yang On Sun, Jul 3, 2011 at 8:25 PM, AJ a...@dude.podzone.net wrote: We seem to be having a fundamental misunderstanding. Thanks for your comments. aj On 7/3/2011 8:28 PM, William Oberman wrote: I'm using cassandra as a tool, like a black box with a certain contract to the world. Without modifying the core, C* will send the updates to all replicas, so your plan would cause the extra write (for the placeholder). I wasn't assuming a modification to how C* fundamentally works. Sounds like you are hacking (or at least looking) at the source, so all the power to you if/when you try these kind of changes. will On Sun, Jul 3, 2011 at 8:45 PM, AJ a...@dude.podzone.net wrote: On 7/3/2011 6:32 PM, William Oberman wrote: Was just going off of: Send the value to the primary replica and send placeholder values to the other replicas. Sounded like you wanted to write the value to one, and write the placeholder to N-1 to me. Yes, that is what I was suggesting. The point of the placeholders is to handle the crash case that I talked about... like a WAL does. But, C* will propagate the value to N-1 eventually anyways, 'cause that's just what it does anyways :-) will On Sun, Jul 3, 2011 at 7:47 PM, AJ a...@dude.podzone.net wrote: On 7/3/2011 3:49 PM, Will Oberman wrote: Why not send the value itself instead of a placeholder? Now it takes 2x writes on a random node to do a single update (write placeholder, write update) and N*x writes from the client (write value, write placeholder to N-1). Where N is replication factor. Seems like extra network and IO instead of less... To send the value to each node is 1.) unnecessary, 2.) will only cause a large burst of network traffic. Think about if it's a large data value, such as a document. Just let C* do it's thing. The extra messages are tiny and doesn't significantly increase latency since they are all sent asynchronously. Of course, I still think this sounds like reimplementing Cassandra internals in a Cassandra client (just guessing, I'm not a cassandra dev) I don't see how. Maybe you should take a peek at the source. On Jul 3, 2011, at 5:20 PM, AJ a...@dude.podzone.net wrote: Yang, How would you deal with the problem when the 1st node responds success but then crashes before completely forwarding any replicas? Then, after switching to the next primary, a read would return stale data. Here's a quick-n-dirty way: Send the value to the primary replica and send placeholder values to the other replicas. The placeholder value is something like, PENDING_UPDATE. The placeholder values are sent with timestamps 1 less than the timestamp for the actual value that went to the primary. Later, when the changes propagate, the actual values
Re: Node repair questions
(not answering (1) right now, because it's more involved) 2. Does a Nodetool Repair block any reads and writes on the node, while the repair is going on ? During repair, if I try to do an insert, will the insert wait for repair to complete first ? It doesn't imply any blocking. It's roughly similar to compaction in its impact on nodes; in addition when data is streamed (if any) the impact should be similar to node bootstrapping. 3. I read that repair can impact your workload as it causes additional disk and cpu activity. But any details of the impact mechanism and any ballpark on how much the read/write performance deteriorates ? The compaction part will have an impact similar to regular compaction except it's read-only (no writing of new sstables). It is subject to compaction throttling if you run a version of Cassandra with compaction throttling. Streaming causes disk/networking load and is not yet rate limited like compaction. In addition be aware that repair can cause disk space usage to temporarily increase if there are significant differences to be repaired. -- / Peter Schuller
Re: Node repair questions
The more often you repair, the quicker it will be. The more often your nodes go down the longer it will be. Going to have to disagree a bit here. In most cases the cost of running through the data and calculating the merkle tree should be quite significant, and hopefully the differences should be fairly limited. The actual data being streamed can be a problem, but unless you have a situation where you are consistently going significantly out-of-synch and there is no read-repair, I wouldn't recommend more frequent repairs if your aim is to minimize the impact on the cluster. (In the general case, there will be exceptions.) Also to OP: In general, expect repairs to be more impactful on your cluster the bigger your data is in comparison to available memory used for caching. Basically the more cache reliant you are, the grater the impact of repairs (and compaction) will tend to be. -- / Peter Schuller
Re: Corrupted data
That looks a lot like what I've seen from machines with bad ram. 2011/7/8 Héctor Izquierdo Seliva izquie...@strands.com: Hi everyone, I'm having thousands of these errors: WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 737) Non-fatal error reading row (stacktrace follows) java.io.IOError: java.io.IOException: Impossible row size 6292724931198053 at org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:719) at org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:633) at org.apache.cassandra.db.compaction.CompactionManager.access $600(CompactionManager.java:65) at org.apache.cassandra.db.compaction.CompactionManager $3.call(CompactionManager.java:250) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor $Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor $Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Impossible row size 6292724931198053 ... 9 more INFO [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 743) Retrying from row index; data is -8 bytes starting at 4735525245 WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705 CompactionManager.java (line 767) Retry failed too. Skipping to next row (retry's stacktrace follows) java.io.IOError: java.io.EOFException: bloom filter claims to be 863794556 bytes, longer than entire row size -8 THis is during scrub, as I saw similar errors while in normal operation. Is there anything I can do? It looks like I'm going to lose a ton of data -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
thrift API
Hi, can anyone explain why APIs include multiget, batch_insert,get_range_slice are removed in Version above 7.0?
commitlog replay missing data
Hey all, Recently upgraded to 0.8.1 and noticed what seems to be missing data after a commitlog replay on a single-node cluster. I start the node, insert a bunch of stuff (~600MB), stop it, and restart it. There are log messages pertaining to the commitlog replay and no errors, but some of the data is missing. If I flush before stopping the node, everything is fine, and running cfstats in the two cases shows different amounts of data in the SSTables. Moreover, the amount of data that is missing is nondeterministic. Has anyone run into this? Thanks. Here is the output of a side-by-side diff between cfstats outputs for a single CF before restarting (left) and after (right). Somehow a 37MB memtable became a 2.9MB SSTable (note the difference in write count as well)? Column Family: Blocks Column Family: Blocks SSTable count: 0 | SSTable count: 1 Space used (live): 0 | Space used (live): 2907637 Space used (total): 0 | Space used (total): 2907637 Memtable Columns Count: 8198 | Memtable Columns Count: 0 Memtable Data Size: 37550510 | Memtable Data Size: 0 Memtable Switch Count: 0 | Memtable Switch Count: 1 Read Count: 0 Read Count: 0 Read Latency: NaN ms. Read Latency: NaN ms. Write Count: 8198 | Write Count: 1526 Write Latency: 0.018 ms. | Write Latency: 0.011 ms. Pending Tasks: 0Pending Tasks: 0 Key cache capacity: 20 Key cache capacity: 20 Key cache size: 0 Key cache size: 0 Key cache hit rate: NaN Key cache hit rate: NaN Row cache: disabled Row cache: disabled Compacted row minimum size: 0 | Compacted row minimum size: 1110 Compacted row maximum size: 0 | Compacted row maximum size: 2299 Compacted row mean size: 0| Compacted row mean size: 1960 Note that I patched https://issues.apache.org/jira/browse/CASSANDRA-2317 in my version, but there are no deletions involved so I don't think it's relevant unless I messed something up while patching. -Jeffrey smime.p7s Description: S/MIME cryptographic signature
Re: CassandraFS in 1.0?
It's not, currently, but I'm happy to answer questions about its architecture. On Thu, Jul 7, 2011 at 10:35, Norman Maurer norman.mau...@googlemail.com wrote: May I ask if its opensource by any chance ? bye norman Am Donnerstag, 7. Juli 2011 schrieb David Strauss da...@davidstrauss.net: I'm not sure HDFS has the right properties for a media-storage file system. We have, however, built a WebDAV server on top of Cassandra that avoids any pretension of being a general-purpose, POSIX-compliant file system. We mount it on our servers using davfs2, which is also nice for a few reasons: * We can use standard HTTP load-balancing and dead host avoidance strategies with WebDAV. * Encrypting access and authenticating clients with PKI/HTTPS works seamlessly. * WebDAV + davfs2 is etag-header aware, allowing clients to efficiently validate cached items. * HTTP is browser and CDN/reverse proxy cache friendly for distributing content to people who don't need to mount the file system. * We could extend the server's support to allow connections from a broad variety of interactive desktop clients. On Wed, Jul 6, 2011 at 13:11, Joseph Stein crypt...@gmail.com wrote: Hey folks, I am going to start prototyping our media tier using cassandra as a file system (meaning upload video/audio/images to web server save in cassandra and then streaming them out) Has anyone done this before? I was thinking brisk's CassandraFS might be a fantastic implementation for this but then I feel that I need to run another/different Cassandra cluster outside of what our ops folks do with Apache Cassandra 0.8.X Am I best to just compress files uploaded to the web server and then start chunking and saving chunks in rows and columns so the mem issue does not smack me in the face? And use our existing cluster and build it out accordingly? I am sure our ops people would like the command line aspect of CassandraFS but looking for something that makes the most sense all around. It seems to me there is a REALLY great thing in CassandraFS and would love to see it as part of 1.0 =8^) or at a minimum some streamlined implementation to-do the same thing. If comparing to HDFS that is part of Hadoop project even though Cloudera has a distribution of Hadoop :) maybe that can work here too _fingers_crosed_ (or mongodb-gridfs) happy to help as I am moving down this road in general Thanks! /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */ -- David Strauss | da...@davidstrauss.net | +1 512 577 5827 [mobile] -- David Strauss | da...@davidstrauss.net | +1 512 577 5827 [mobile]