Re: OutOfMemory on count on cassandra 0.6.8 for large number of columns
Thanks Tyler. I was unaware of counters. The use case for column counts is really from a operational perspective, to allow a sysadmin to do adhoc checks on columns to see if something has gone wrong in software outside of cassandra. I think running a cassandra-cli command such as count, which makes cassandra fall over is not ideal, unless we can say for X number of columns cassandra needs at least Y memory allocation for stability. Cheers Dave On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs ty...@riptano.com wrote: Cassandra has to deserialize all of the columns in the row for get_count(). So from Cassandra's perspective, it's almost as much work as getting the entire row, it just doesn't have to send everything back over the network. If you're frequently counting 8 million columns (or really, anything significant), you need to use counters instead. If this is a rare occurrence, you can do the count in multiple chunks by using a starting and ending column in the SlicePredicate for each chunk, but this requires some rough knowledge about the distribution of the column names in the row. - Tyler
Re: N to N relationships
You want to store every value twice? That would be a pain to maintain, and possibly lead to inconsistent data. On Fri, Dec 10, 2010 at 3:50 AM, Nick Bailey n...@riptano.com wrote: I would also recommend two column families. Storing the key as NxN would require you to hit multiple machines to query for an entire row or column with RandomPartitioner. Even with OPP you would need to pick row or columns to order by and the other would require hitting multiple machines. Two column families avoids this and avoids any problems with choosing OPP. On Thu, Dec 9, 2010 at 2:26 PM, Aaron Morton aa...@thelastpickle.comwrote: Am assuming you have one matrix and you know the dimensions. Also as you say the most important queries are to get an entire column or an entire row. I would consider using a standard CF for the Columns and one for the Rows. The key for each would be the col / row number, each cassandra column name would be the id of the other dimension and the value whatever you want. - when storing the data update both the Column and Row CF - reading a whole row/col would be simply reading from the appropriate CF. - reading an intersection is a get_slice to either col or row CF using the column_names field to identify the other dimension. You would not need secondary indexes to serve these queries. Hope that helps. Aaron On 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote: I mean if I have secondary indexes. Apparently they are calculated in the background... On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.comwrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.comwrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon
Quorum and Datacenter loss
Hi cassandra experts - We're planning a cassandra cluster across 2 datacenters (datacenter-aware, random partitioning) with QUORUM consistency. It seems to me that with 2 datacenters, if one datacenter is lost, the reads/writes to cassandra will fail in the surviving datacenter because of the N/2 + 1 distribution of replicas. In other words, you need more than half of the replicas to respond but in the case of a datacenter loss you would only ever get 1/2 to respond at best. Is my logic wrong here? Is there a way to ensure the nodes in the alive datacenter respond successfully if the second datacenter is lost? Anyone have experience with this kind of problem? Thanks.
Re: Quorum and Datacenter loss
Is my logic wrong here? Is there a way to ensure the nodes in the alive datacenter respond successfully if the second datacenter is lost? Anyone have experience with this kind of problem? It's impossible to achieve the consistency and availability at the same time. See: http://en.wikipedia.org/wiki/CAP_theorem -- / Peter Schuller
Re: Quorum and Datacenter loss
Is my logic wrong here? Is there a way to ensure the nodes in the alive datacenter respond successfully if the second datacenter is lost? Anyone have experience with this kind of problem? It's impossible to achieve the consistency and availability at the same time. See: (Assuming partition tolerance) Anyways, to expand a bit: The final consequence is that if you have a cluster that really does need QUORUM consistency, you won't be able to survive (in terms of availability, i.e., the cluster serving your traffic) data centers going down. If you want to continue operating in the case of a partition, you (1) cannot use QUORUM and (2) your application must be designed to work with and survive seeing inconsistent data. -- / Peter Schuller
Unsubscribe
Unsubscribe Please Sent from my iPad On Dec 12, 2010, at 1:26 AM, Dave Martin moyesys...@googlemail.com wrote: Hi there, I see the following: 1) Add 8,000,000 columns to a single row. Each column name is a UUID. 2) Use cassandra-cli to run count keyspace.cf['myGUID'] The following is reported in the logs: ERROR [DroppedMessagesLogger] 2010-12-12 18:17:36,046 CassandraDaemon.java (line 87) Uncaught exception in thread Thread[DroppedMessagesLogger,5,main] java.lang.OutOfMemoryError: Java heap space ERROR [pool-1-thread-2] 2010-12-12 18:17:36,046 Cassandra.java (line 1407) Internal error processing get_count java.lang.OutOfMemoryError: Java heap space and Cassandra falls over. I see the same behaviour with 0.6.6. Increasing the memory allocation with the -Xmx -Xms args to 4GB allows the count to return in this particular example (i.e. no OutOfMemory is thrown). Here's the scala code that was ran to load the column, which uses the AKKA persistence API: object ColumnTest { def main(args : Array[String]) : Unit = { println(Super column test starting) val hosts = Array{localhost} val sessions = new CassandraSessionPool(occurrence,StackPool(SocketProvider(localhost, 9160)),Protocol.Binary,ConsistencyLevel.ONE) val session = sessions.newSession loadRow(myGUID, 800, session) session.close } def loadRow(key:String, noOfColumns:Int, session:CassandraSession){ print(loading: +key+, with columns: +noOfColumns) val start = System.currentTimeMillis val rawPath = new ColumnPath(dr) for(i - 0 until noOfColumns){ val recordUuid = UUID.randomUUID.toString session ++| (key, rawPath.setColumn(recordUuid.getBytes), 1.getBytes, System.currentTimeMillis) session.flush } val finish = System.currentTimeMillis print(, Time taken (secs) : +((finish-start)/1000) + seconds.\n) } } Heres the configuration used: # Arguments to pass to the JVM JVM_OPTS= \ -ea \ -Xms1G \ -Xmx2G \ -XX:+UseParNewGC \ -XX:+UseConcMarkSweepGC \ -XX:+CMSParallelRemarkEnabled \ -XX:SurvivorRatio=8 \ -XX:MaxTenuringThreshold=1 \ -XX:CMSInitiatingOccupancyFraction=75 \ -XX:+UseCMSInitiatingOccupancyOnly \ -XX:+HeapDumpOnOutOfMemoryError \ -Dcom.sun.management.jmxremote.port=8080 \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false Admittedly the resource allocation is small, but I wondered if there should be some configuration guidelines (e.g. memory allocation vs number of columns supported). Im running this on my MBP with a single node and java as thus: $ java -version java version 1.6.0_22 Java(TM) SE Runtime Environment (build 1.6.0_22-b04-307-10M3261) Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03-307, mixed mode) Heres the CF definition: Keyspace Name=occurrence ColumnFamily Name=dr CompareWith=UTF8Type Comment=The column family for dataset tracking/ ReplicaPlacementStrategyorg.apache.cassandra.locator.RackUnawareStrategy/ReplicaPlacementStrategy ReplicationFactor1/ReplicationFactor EndPointSnitchorg.apache.cassandra.locator.EndPointSnitch/EndPointSnitch /Keyspace Apologies in advance if this is a known issue or a known limitation of 0.6.x. I had wondered if I was hitting the 2GB row limit for 0.6.x releases, but 8mill columns = 300MB approx in this particular case. I guess it may also be a result of the limitations with thrift (i.e. no streaming capabilities). Any thoughts appreciated, Dave
Re: Unsubscribe
Unsubscribe http://wiki.apache.org/cassandra/FAQ#unsubscribe -- / Peter Schuller
Re: Quorum and Datacenter loss
Thanks a lot Peter. So basically we would need to choose a consistency other than QUORUM.I think in our case consistency is not necessarily an issue since our data is write-once, read-many (immutable data). I suppose having a replication factor of 4 would result in two nodes in each datacenter having a copy of the data. If there's a flaw in my logic, please let me know : ] On Sun, Dec 12, 2010 at 2:04 PM, Peter Schuller peter.schul...@infidyne.com wrote: Is my logic wrong here? Is there a way to ensure the nodes in the alive datacenter respond successfully if the second datacenter is lost? Anyone have experience with this kind of problem? It's impossible to achieve the consistency and availability at the same time. See: (Assuming partition tolerance) Anyways, to expand a bit: The final consequence is that if you have a cluster that really does need QUORUM consistency, you won't be able to survive (in terms of availability, i.e., the cluster serving your traffic) data centers going down. If you want to continue operating in the case of a partition, you (1) cannot use QUORUM and (2) your application must be designed to work with and survive seeing inconsistent data. -- / Peter Schuller
Re: Memory leak with Sun Java 1.6 ?
On Dec 10, 2010, at 19:37, Peter Schuller wrote: To cargo cult it: Are you running a modern JVM? (Not e.g. openjdk b17 in lenny or some such.) If it is a JVM issue, ensuring you're using a reasonably recent JVM is probably much easier than to start tracking it down... I had OOM problems with OpenJDK, switched to Sun/Oracle's recent 1.6.0_23 and...still have the same problem :-\ Stack trace always looks the same: java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:329) at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:261) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120) at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:383) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:393) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:351) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) I'm writing from 1 client with 50 threads to a cluster of 4 machines (with hector). With QUORUM and ONE 2 machines quite reliably will soon die with OOM. What may cause this? Won't cassandra block/reject when memtable is full and being flushed to disk but grow and if flushing to disk isn't fast enough will run out of memory?
Re: Memory leak with Sun Java 1.6 ?
http://www.riptano.com/docs/0.6/troubleshooting/index#nodes-are-dying-with-oom-errors On Sun, Dec 12, 2010 at 9:52 AM, Timo Nentwig timo.nent...@toptarif.dewrote: On Dec 10, 2010, at 19:37, Peter Schuller wrote: To cargo cult it: Are you running a modern JVM? (Not e.g. openjdk b17 in lenny or some such.) If it is a JVM issue, ensuring you're using a reasonably recent JVM is probably much easier than to start tracking it down... I had OOM problems with OpenJDK, switched to Sun/Oracle's recent 1.6.0_23 and...still have the same problem :-\ Stack trace always looks the same: java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:329) at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:261) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129) at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120) at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:383) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:393) at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:351) at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) I'm writing from 1 client with 50 threads to a cluster of 4 machines (with hector). With QUORUM and ONE 2 machines quite reliably will soon die with OOM. What may cause this? Won't cassandra block/reject when memtable is full and being flushed to disk but grow and if flushing to disk isn't fast enough will run out of memory? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Dynamic Snitch / Read Path Questions
Hi again. It would be great if someone could comment whether the following is true or not. I tried to understand the consequences of using |-Dcassandra.dynamic_snitch=true for the read path |and that's what I came up with: 1) If using CL 1 than using the dynamic snitch will result in a data read from node with the lowest latency (little simplified) even if the proxy node contains the data but has a higher latency that other possible nodes which means that it is not necessary to do load-based balancing on the client side. 2) If using CL =1 than the proxy node will always return the data itself even when there is another node with less load. 3) Digest requests will be sent to all other living peer nodes for that key and will result in a data read on all nodes to calculate the digest. The only difference is that the data is not sent back but IO-wise it is just as expensive. The next one goes a little further: We read / write with quorum / rf = 3. It seems to me that it wouldn't be hard to patch the StorageProxy to send only one read request and one digest request. Only if one of the requests fail we would have to query the remaining node. We don't need read repair because we have to repair once a week anyways and quorum guarantees consistency. This way we could reduce read load significantly which should compensate for latency increase by failing reads. Am I missing something? Best, Daniel
Re: Quorum and Datacenter loss
Thanks a lot Peter. So basically we would need to choose a consistency other than QUORUM. I think in our case consistency is not necessarily an issue since our data is write-once, read-many (immutable data). I suppose having a replication factor of 4 would result in two nodes in each datacenter having a copy of the data. If there's a flaw in my logic, please let me know : ] It would, but note that if you're writing at consistency level ONE only a single copy of the data is required to exist before your write is ACK:ed back to the client (but it will still be replicated). -- / Peter Schuller
iterate over all the rows with RP
Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? Shimi
Re: iterate over all the rows with RP
Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? In general, the choice of RPC connection (I assume you mean the underlying thrift connection) does not affect the semantics of the RPC calls. -- / Peter Schuller
Re: iterate over all the rows with RP
So if I will use a different connection (thrift via Hector), will I get the same results? It's make sense when you use OPP and I assume it is the same with RP. I just wanted to make sure this is the case and there is no state which is kept. Shimi On Sun, Dec 12, 2010 at 8:14 PM, Peter Schuller peter.schul...@infidyne.com wrote: Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? In general, the choice of RPC connection (I assume you mean the underlying thrift connection) does not affect the semantics of the RPC calls. -- / Peter Schuller
Re: N to N relationships
On Sun, Dec 12, 2010 at 3:20 AM, David Boxenhorn da...@lookin2.com wrote: You want to store every value twice? That would be a pain to maintain, and possibly lead to inconsistent data. On Fri, Dec 10, 2010 at 3:50 AM, Nick Bailey n...@riptano.com wrote: I would also recommend two column families. Storing the key as NxN would require you to hit multiple machines to query for an entire row or column with RandomPartitioner. Even with OPP you would need to pick row or columns to order by and the other would require hitting multiple machines. Two column families avoids this and avoids any problems with choosing OPP. On Thu, Dec 9, 2010 at 2:26 PM, Aaron Morton aa...@thelastpickle.com wrote: Am assuming you have one matrix and you know the dimensions. Also as you say the most important queries are to get an entire column or an entire row. I would consider using a standard CF for the Columns and one for the Rows. The key for each would be the col / row number, each cassandra column name would be the id of the other dimension and the value whatever you want. - when storing the data update both the Column and Row CF - reading a whole row/col would be simply reading from the appropriate CF. - reading an intersection is a get_slice to either col or row CF using the column_names field to identify the other dimension. You would not need secondary indexes to serve these queries. Hope that helps. Aaron On 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote: I mean if I have secondary indexes. Apparently they are calculated in the background... On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.com wrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.com wrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon Before secondary indexes the only option was to store the data twice. Yes you have to maintain this yourself. The data model only provides fast searches on the key. An index normally a separate entity with different ordering, almost the same here.
Re: OutOfMemory on count on cassandra 0.6.8 for large number of columns
Well, in this case I would say you probably need about 300MB of space in the heap, since that's what you've calculated. The APIs are designed to let you do what you think is best and they definitely won't stop you from shooting yourself in the foot. Counting a huge row, or trying to grab every row in a large column family are examples of this. Some of the clients try to protect you from this, but there is only so much that can be done without specific knowledge of the data, and get_count() is an example of this. While we're on the topic of large rows, if your row is essentially unbounded in size, you need to consider splitting it. This is especially true if you stay with 0.6, where compactions of large rows can OOM you pretty easily. - Tyler On Sun, Dec 12, 2010 at 2:07 AM, Dave Martin moyesys...@googlemail.comwrote: Thanks Tyler. I was unaware of counters. The use case for column counts is really from a operational perspective, to allow a sysadmin to do adhoc checks on columns to see if something has gone wrong in software outside of cassandra. I think running a cassandra-cli command such as count, which makes cassandra fall over is not ideal, unless we can say for X number of columns cassandra needs at least Y memory allocation for stability. Cheers Dave On Sun, Dec 12, 2010 at 6:39 PM, Tyler Hobbs ty...@riptano.com wrote: Cassandra has to deserialize all of the columns in the row for get_count(). So from Cassandra's perspective, it's almost as much work as getting the entire row, it just doesn't have to send everything back over the network. If you're frequently counting 8 million columns (or really, anything significant), you need to use counters instead. If this is a rare occurrence, you can do the count in multiple chunks by using a starting and ending column in the SlicePredicate for each chunk, but this requires some rough knowledge about the distribution of the column names in the row. - Tyler
Re: iterate over all the rows with RP
This should be the case, yes, semantics isn't affected by the connection and state isn't kept. What might happen if you read/write with low consistency levels then when you hit a different host on the ring it might have an inconsistent state in case of partition. On Sunday, December 12, 2010, shimi shim...@gmail.com wrote: So if I will use a different connection (thrift via Hector), will I get the same results? It's make sense when you use OPP and I assume it is the same with RP. I just wanted to make sure this is the case and there is no state which is kept. Shimi On Sun, Dec 12, 2010 at 8:14 PM, Peter Schuller peter.schul...@infidyne.com wrote: Is the same connection is required when iterating over all the rows with Random Paritioner or is it possible to use a different connection for each iteration? In general, the choice of RPC connection (I assume you mean the underlying thrift connection) does not affect the semantics of the RPC calls. -- / Peter Schuller -- /Ran