Re: Geohash nearby query implementation in Cassandra.
2012/2/17 Raúl Raja Martínez raulr...@gmail.com Hello everyone, I'm working on a application that uses Cassandra and has a geolocation component. I was wondering beside the slides and video at http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php that simplegeo published regarding their strategy if anyone has implemented geohash storage and search in cassandra. The basic usage is to allow a user to find things close to a geo location based on distance radius. I though about a couple of approaches. 1. Have the geohashes be the keys using the Ordered partitioner and get a group of rows between keys then store the items as columns in what it would end up looking like wide rows since each column would point to another row in a different column family representing the item nearby. That's what we did early on at SimpleGeo. 2. Simply store the geohash prefixes as columns and use secondary indexes to do queries such as = and =. This seems like a reasonable approach now that secondary indexes are available. It might even address some of the hotspot problems we had with the order preserving partitioner since the indices are distributed across all hosts. Of course there are tradeoffs there too. Seems like a viable option for sure. The problem I'm facing in both cases is ordering by distance and searching neighbors. This will always be a problem with dimensionality reduction techniques like geohashes. A brief bit of pedantry: it is mathematically impossible to do dimensionality reduction without losing information. You can't embed a 2 dimensional space in a 1 dimensional space and preserve the 2D topology. This manifests itself all sorts of ways, but when it comes to doing kNN queries it's particularly obvious. Things that are near in 2D space can be far apart in 1D space and vice versa. Doing a 1D embedding like this will always result in suboptimal performance for at least some queries. You'll have to over-fetch and post-process to get the correct results. That said, a 1D embedding is certainly easier to code since multidimensional indexes are not available in Cassandra. And there are plenty of data sets that don't hit any degenerate cases. Moreover, if you're mostly doing bounding-radius queries the geohash approach isn't nearly as bad (the only trouble comes when you want to limit the results, in which case you often want things ordered by distance from centroid and the query is no longer a bounding radius query - rather, it's a kNN with a radius constraint). In any case, geohash is a reasonable starting point, at least. The neighbors problem is clearly explained here: https://github.com/davetroy/geohash-js Once the neighbors are calculated an item can be fetched with SQL similar to this. SELECT * FROM table WHERE LEFT(geohash,6) IN ('dqcjqc', 'dqcjqf','dqcjqb','dqcjr1','dqcjq9','dqcjqd','dqcjr4','dqcjr0','dqcjq8') Since Cassandra does not currently support OR or a IN statement with elements that are not keys I'm not sure what the best way to implement geohashes may be. Can't you use the thrift interface and use multiget_slice? If I recall correctly, we implemented a special version of multiget_slice that stopped when we got a certain number of columns across all rows. I don't have that code handy but we did that work early in our Cassandra careers and, starting from the thrift interface and following control flow for the multiget_slice command, it wasn't terribly difficult to add. Mike
Re: Write everywhere, read anywhere
2011/8/3 Patricio Echagüe patric...@gmail.com On Wed, Aug 3, 2011 at 4:00 PM, Philippe watche...@gmail.com wrote: Hello, I have a 3-node, RF=3, cluster configured to write at CL.ALL and read at CL.ONE. When I take one of the nodes down, writes fail which is what I expect. When I run a repair, I see data being streamed from those column families... that I didn't expect. How can the nodes diverge ? Does this mean that reading at CL.ONE may return inconsistent data ? we abort the mutation before hand when there are enough replicas alive. If a mutation went through and in the middle of it a replica goes down, in that case you can write to some nodes and the request will Timeout. In that case the CL.ONE may return inconsistence data. Doesn't CL.QUORUM suffer from the same problem? There's no isolation or rollback with CL.QUORUM either. So if I do a quorum write with RF=3 and it fails after hitting a single node, a subsequent quorum read could return the old data (if it hits the two nodes that didn't receive the write) or the new data that failed mid-write (if it hits the node that did receive the write). Basically, the scenarios where CL.ALL + CL.ONE results in a read of inconsistent data could also cause a CL.QUORUM write followed by a CL.QUORUM read to return inconsistent data. Right? The problem (if there is one) is that even in the quorum case columns with the most recent timestamp win during repair resolution, not columns that have quorum consensus. Mike
Re: Write everywhere, read anywhere
On Thu, Aug 4, 2011 at 10:25 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: If you have RF=3 quorum won’t fail with one node down. So R/W quorum will be consistent in the case of one node down. If two nodes go down at the same time, then you can get inconsistent data from quorum write/read if the write fails with TimeOut, the nodes come back up, and then read asks the two nodes that were down what the value is. And another read asks the node that was up, and a node that was down. Those two reads will get different answers. So the short answer is: yea, same thing can happen with quorum... It's true that the failure scenarios are slightly different, but it's not entirely true that two nodes need to fail to trigger inconsistencies with quorum. A single node could be partitioned and produce the same result. If a network event occurs on a single host then any writes that came in before the event, that are processed before phi evict kicks in and marks the rest of the cluster unavailable, will be written locally. From the rest of the cluster's perspective only one node failed, but from that node's perspective the entire rest of the cluster failed. Obviously, similar things could happen with DC_QUORUM if a datacenter went offline. Mike
Re: b-tree
On Fri, Jul 22, 2011 at 12:05 AM, Eldad Yamin elda...@gmail.com wrote: In order order to split the nodes. SimpleGeo have max 1,000 recods (i.e places) on each node in the tree, if the number is 1,000 they split the node. In order to avoid that more then 1 process will edit/split the node - transaction is needed. You don't need a transaction, you just need consensus and/or idempotence. In this case both can be achieved fairly easily. Mike On Jul 22, 2011 1:01 AM, aaron morton aa...@thelastpickle.com wrote: But how will you be able to maintain it while it evolves and new data is added without transactions? What is the situation you think you need transactions for ? Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 22 Jul 2011, at 00:06, Eldad Yamin wrote: Aaron, Nested set is exactly what I had in mind. But how will you be able to maintain it while it evolves and new data is added without transactions? Thanks! On Thu, Jul 21, 2011 at 1:44 AM, aaron morton aa...@thelastpickle.com wrote: Just throwing out a (half baked) idea, perhaps the Nested Set Model of trees would work http://en.wikipedia.org/wiki/Nested_set_model * Ever row would represent a set with a left and right encoded into the key * Members are inserted as columns into *every* set / row they are a member. So we are de-normalising and trading space for time. * May need to maintain a custom secondary index of the materialised sets. e.g. slice a row to get the first column = the left value you are interested in, that is the key for the set. I've not thought it through much further than that, a lot would depend on your data. The top sets may get very big, . Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 21 Jul 2011, at 08:33, Jeffrey Kesselman wrote: Im not sure if I have an answer for you, anyway, but I'm curious A b-tree and a binary tree are not the same thing. A binary tree is a basic fundamental data structure, A b-tree is an approach to storing and indexing data on disc for a database. Which do you mean? On Wed, Jul 20, 2011 at 4:30 PM, Eldad Yamin elda...@gmail.com wrote: Hello, Is there any good way of storing a binary-tree in Cassandra? I wonder if someone already implement something like that and how accomplished that without transaction supports (while the tree keep evolving)? I'm asking that becouse I want to save geospatial-data, and SimpleGeo did it using b-tree: http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php Thanks! -- It's always darkest just before you are eaten by a grue.
Re: Commitlog Disk Full
Just noticed this thread and figured I'd chime in since we've had similar issues with the commit log growing too large on our clusters. Tuning down the flush timeout wasn't really an acceptable solution for us since we didn't want to be constantly flushing and generating extra SSTables for no reason. So we wrote a small tool that we start in a static block in CassandraServer that periodically checks the commit log size and flushes all memtables if they're above some threshold. I've attached that code. Any feedback / improvements are more than welcome. Mike On Thu, May 12, 2011 at 11:30 AM, Sanjeev Kulkarni sanj...@locomatix.comwrote: Hey guys, I have a ec2 debian cluster consisting of several nodes running 0.7.5 on ephimeral disks. These are fresh installs and not upgrades. The commitlog is set to the smaller of the disks which is around 10G in size and the datadir is set to the bigger disk. The config file is basically the same as the one supplied by the default installation. Our applications write to the cluster. After about a day of writing we started noticing the commitlog disk filling up. Soon we went over the disk limit and writes started failing. At this point we stopped the cluster. Over the course of the day we inserted around 25G of data. Our columns values are pretty small. I understand that cassandra periodically cleans up the commitlog directories by generating sstables in datadir. Is there any way to speed up this movement from commitog to datadir? Thanks! PeriodicMemtableFlusher.java Description: Binary data
Re: Do supercolumns have a purpose?
On Tue, Feb 8, 2011 at 2:03 AM, David Boxenhorn da...@lookin2.com wrote: Shaun, I agree with you, but marking them as deprecated is not good enough for me. I can't easily stop using supercolumns. I need an upgrade path. David, Cassandra is open source and community developed. The right thing to do is what's best for the community, which sometimes conflicts with what's best for individual users. Such strife should be minimized, it will never be eliminated. Luckily, because this is an open source, liberal licensed project, if you feel strongly about something you should feel free to add whatever features you want yourself. I'm sure other people in your situation will thank you for it. At a minimum I think it would behoove you to re-read some of the comments here re: why super columns aren't really needed and take another look at your data model and code. I would actually be quite surprised to find a use of super columns that could not be trivially converted to normal columns. In fact, it should be possible to do at the framework/client library layer - you probably wouldn't even need to change any application code. Mike On Tue, Feb 8, 2011 at 3:53 AM, Shaun Cutts sh...@cuttshome.net wrote: I'm a newbie here, but, with apologies for my presumptuousness, I think you should deprecate SuperColumns. They are already distracting you, and as the years go by the cost of supporting them as you add more and more functionality is only likely to get worse. It would be better to concentrate on making the core column families better (and I'm sure we can all think of lots of things we'd like). Just dropping SuperColumns would be bad for your reputation -- and for users like David who are currently using them. But if you mark them clearly as deprecated and explain why and what to do instead (perhaps putting a bit of effort into migration tools... or even a virtual layer supporting arbitrary hierarchical data), then you can drop them in a few years (when you get to 1.0, say), without people feeling betrayed. -- Shaun On Feb 6, 2011, at 3:48 AM, David Boxenhorn wrote: My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then let me say what I want: I want supercolumn families to have any feature that regular column families have. My data model is full of supercolumns. I used them, even though I knew it didn't *have to*, because they were there, which implied to me that I was supposed to use them for some good reason. Now I suspect that they will gradually become less and less functional, as features are added to regular column families and not supported for supercolumn families. On Fri, Feb 4, 2011 at 10:58 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Fri, Feb 4, 2011 at 12:35 AM, Mike Malone m...@simplegeo.com wrote: On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.comwrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I realize that this is largely subjective, and on such matters code speaks louder than words, but I don't think I agree with you on the issue of which alternative is less work, or even which is a better solution. You are right, I put probably too much emphase in that sentence. My main point was to say that it's think it is better to create tickets for what you want, rather than for something else completely different that would, as a by-product, give you what you want. Then I suspect that *if* the only goal is to get secondary indexes on super columns, then there is a good chance this would be less work than getting rid of super columns. But to be fair, secondary indexes on super columns may not make too much sense without #598, which itself would require quite some work, so clearly I spoke a bit quickly. If the goal is to have a hierarchical model, limiting the depth to two seems arbitrary. Why not go all the way and allow an arbitrarily deep hierarchy? If a more sophisticated hierarchical model is deemed unnecessary, or impractical, allowing a depth of two seems inconsistent and unnecessary. It's pretty trivial to overlay a hierarchical model on top of the map-of-sorted-maps model that Cassandra implements. Ed Anuff has implemented a custom comparator that does the job [1]. Google's Megastore has a similar architecture and goes even further [2]. It seems to me that super columns are a historical artifact from Cassandra's early life as Facebook's inbox storage system. They needed posting lists of messages, sharded
Re: postgis cassandra?
It's not really the storage of spatial data that's tricky. We use geojson as a wire-line format at the higher levels of our system (e.g., the HTTP API). But the hard part is organizing the data for efficient retrieval and keeping those indices consistent with the data being indexed. Efficient multi-dimensional indexing is tricky, but that's what you'll need if you want to support generic spatial querying (overlaps, contains, interacts, nearest neighbor, etc). On Sun, Feb 6, 2011 at 1:14 PM, Aaron Morton aa...@thelastpickle.comwrote: Here is a recent presentation from simplegeo.com that may provide some inspiration http://strangeloop2010.com/system/talks/presentations/000/014/495/Malone-DimensionalDataDHT.pdf Can you provide some more details on the data you want to store and queries you want to run ? Aaron On 6/02/2011, at 7:04 AM, Sean Ochoa sean.m.oc...@gmail.com wrote: That's a good question, Bill. The data that I'm trying to store begins as a simple point. But, moving fo= rward, it will become more like complex geometries. I assume that I can si= mply create a JSON-like object and insert it. Which, for now, that works. = I'm just wondering if theres a typical / publicly accepted standard of sto= ring somewhat complex spatial data in Cassandra. Additionally, I would like to figure out how one goes about slicing on large spatial data sets given situations where, for instance, I would like to get all the points in a column-family where the point is within a shape. I guess it boils down to using a spatial comparator of some sort, but I haven't seen one, yet. - Sean On Sat, Feb 5, 2011 at 9:51 AM, William R Speirs bill.spe...@gmail.com bill.spe...@gmail.com wrote: I know nothing about postgis and little about spacial data, but if you're simply talking about data that relates to some latitude longitude pair, you could have your row key simply be the concatenation of the two: lat:long. Can you provide more details about the type of data you're looking to store? Thanks... Bill- On 02/05/2011 12:22 PM, Sean Ochoa wrote: Can someone tell me how to represent spatial data (coming from postgis) in Cassandra? - Sean -- Sean | M (206) 962-7954 | GV (760) 624-8718
Re: Do supercolumns have a purpose?
On Thu, Feb 3, 2011 at 6:44 AM, Sylvain Lebresne sylv...@datastax.comwrote: On Thu, Feb 3, 2011 at 3:00 PM, David Boxenhorn da...@lookin2.com wrote: The advantage would be to enable secondary indexes on supercolumn families. Then I suggest opening a ticket for adding secondary indexes to supercolumn families and voting on it. This will be 1 or 2 order of magnitude less work than getting rid of super column internally, and probably a much better solution anyway. I realize that this is largely subjective, and on such matters code speaks louder than words, but I don't think I agree with you on the issue of which alternative is less work, or even which is a better solution. If the goal is to have a hierarchical model, limiting the depth to two seems arbitrary. Why not go all the way and allow an arbitrarily deep hierarchy? If a more sophisticated hierarchical model is deemed unnecessary, or impractical, allowing a depth of two seems inconsistent and unnecessary. It's pretty trivial to overlay a hierarchical model on top of the map-of-sorted-maps model that Cassandra implements. Ed Anuff has implemented a custom comparator that does the job [1]. Google's Megastore has a similar architecture and goes even further [2]. It seems to me that super columns are a historical artifact from Cassandra's early life as Facebook's inbox storage system. They needed posting lists of messages, sharded by user. So that's what they built. In my dealings with the Cassandra code, super columns end up making a mess all over the place when algorithms need to be special cased and branch based on the column/supercolumn distinction. I won't even mention what it does to the thrift interface. Mike [1] http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html [2] http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf
Re: GeoIndexing in Cassandra, Open Sourced?
A more recent preso I gave about the SimpleGeo architecture is up at http://strangeloop2010.com/system/talks/presentations/000/014/495/Malone-DimensionalDataDHT.pdf Mike On Fri, Jan 21, 2011 at 10:02 AM, Joseph Stein crypt...@gmail.com wrote: I hear that a bunch of folks have GeoIndexing built on top of Cassandra and running in production. Any of them open sourced (Twitter? SimpleGeo? Bueller?) planning on it? /* Joe Stein http://www.linkedin.com/in/charmalloc Twitter: @allthingshadoop */
Re: cassandra row cache
Digest reads could be being dropped..? On Thu, Jan 13, 2011 at 4:11 PM, Jonathan Ellis jbel...@gmail.com wrote: On Thu, Jan 13, 2011 at 2:00 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Is it possible that your are reading at READ.ONE and that READ.ONE only warms cache on 1 of your three nodes= 20. 2nd read warms another 60%, and by the third read all the replicas are warm? 99% ? This would be true if digest reads were not warming caches. Digest reads do go through the cache path. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Cassandra freezes under load when using libc6 2.11.1-0ubuntu7.5
Hey folks, We've discovered an issue on Ubuntu/Lenny with libc6 2.11.1-0ubuntu7.5 (it may also affect versions between 2.11.1-0ubuntu7.1 and 2.11.1-0ubuntu7.4). The bug affects systems when a large number of threads (or processes) are created rapidly. Once triggered, the system will become completely unresponsive for ten to fifteen minutes. We've seen this issue on our production Cassandra clusters under high load. Cassandra seems particularly susceptible to this issue because of the large thread pools that it creates. In particular, we suspect the unbounded thread pool for connection management may be pushing some systems over the edge. We're still trying to narrow down what changed in libc that is causing this issue. We also haven't tested things outside of xen, or on non-x86 architectures. But if you're seeing these symptoms, you may want to try upgrading libc6. I'll send out an update if we find anything else interesting. If anyone has any thoughts as to what the cause is, we're all ears! Hope this saves someone some heart-ache, Mike
Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data
Hey Takayuki, I don't think you're going to find anyone willing to promise that Cassandra will fit your petabyte scale data analysis problem. That's a lot of data, and there's not a ton of operational experience at that scale within the community. And the people who do work on that sort of problem tend to be busy ;). If your problem is that big, you're probably going to need to do some experimentation and see if the system will scale for you. I'm sure someone here can answer any specific questions that may come up if you do that sort of work. As you mentioned, the first concern I'd have with a cluster that big is whether gossip will scale. I'd suggest taking a look at the gossip code. Cassandra nodes are omniscient in the sense that they all try to maintain full ring state for the entire cluster. At a certain cluster size that no longer works. My best guess is that a cluster of 1000 machines would be fine. Maybe even an order of maginitude bigger than that. I could be completely wrong, but given the low overhead that I've observed that estimate seems reasonable. If you do find that gossip won't work in your situation it would be interesting to hear why. You may even consider modifying / updating gossip to work for you. The code isn't as scary as it may seem. At that scale it's likely you'll encounter bugs and corner cases that other people haven't, so it's probably worth familiarizing yourself with the code anyways if you decide to use Cassandra. Mike On Tue, Oct 26, 2010 at 1:09 AM, Takayuki Tsunakawa tsunakawa.ta...@jp.fujitsu.com wrote: Hello, Edward, Thank you for giving me insight about large disk nodes. From: Edward Capriolo edlinuxg...@gmail.com Index sampling on start up. If you have very small rows your indexes become large. These have to be sampled on start up and sampling our indexes for 300Gb of data can take 5 minutes. This is going to be optimized soon. 5 minutes for 300 GB data ... it's not cheap, is it? Simply, 3 TB of data will leat to 50 minutes just for computing input splits. This is too expensive when I want only part of the 3 TB data. (Just wanted to note some of this as I am in the middle of a process of joining a node now :) Good luck. I'd appreciate if you could some performance numbers of joining nodes (amount of data, time to distribute data, load impact on applications, etc) if you can. The cluster our customer is thinking of is likely to become very large, so I'm interested in the elasticity. Yahoo!'s YCSB report makes me worry about adding nodes. Regards, Takayuki Tsunakawa From: Edward Capriolo edlinuxg...@gmail.com [Q3] There are some challenges with very large disk nodes. Caveats: I will use words like long, slow, and large relatively. If you have great equipment IE. 10G Ethernet between nodes it will not take long to transfer data. If you have an insane disk pack it may not take long to compact 200GB of data. I am basing these statements on server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID. Index sampling on start up. If you have very small rows your indexes become large. These have to be sampled on start up and sampling our indexes for 300Gb of data can take 5 minutes. This is going to be optimized soon. Joining nodes: When you go with larger systems joining a new node involves a lot of transfer, and can take a long time. Node join process is going to be optimized in 0.7 and 0.8 (quite drastic changes in 0.7) Major compaction and very large normal compaction can take a long time. For example while doing a 200 GB compaction that takes 30 minutes, other sstables build up, more sstables mean slower reads. Achieving a high RAM/DISK ratio may be easier with smaller nodes vs one big node with 128 GB RAM $$$. As Jonathan pointed out nothing technically is stopping larger disk nodes. (Just wanted to note some of this as I am in the middle of a process of joining a node now :)
Re: what causes MESSAGE-DESERIALIZER-POOL to spike
This may be your problem: https://issues.apache.org/jira/browse/CASSANDRA-1358 The message deserializer executor is being created with a core pool size of 1. Since it uses a queue with unbounded capacity new requests are always queued and the thread pool never grows. So the message deserializer becomes a single-threaded bottleneck through which all traffic must pass. So your 16 cores are reduced to one core for handling all inter-node communication (and any intra-node communication that's being passed through the messaging service). Mike On Tue, Aug 3, 2010 at 10:02 PM, Dathan Pattishall datha...@gmail.comwrote: The output of htop shows threads as procs with a breakdown of how much cpu /etc per thread (in ncurses color!). All of these Java procs are just Java threads of only 1 instance of Cassandra per Server. On Sat, Jul 31, 2010 at 3:45 PM, Benjamin Black b...@b3k.us wrote: Sorry, I just noticed: are you running 14 instances of Cassandra on a single physical machine or are all those java processes something else? On Mon, Jul 26, 2010 at 12:22 PM, Dathan Pattishall datha...@gmail.com wrote: I have 4 nodes on enterprise type hardware (Lots of Ram 12GB, 16 i7 cores, RAID Disks). ~# /opt/cassandra/bin/nodetool --host=localhost --port=8181 tpstats Pool NameActive Pending Completed STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 516280 ROW-READ-STAGE8 40961164326 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 16820081818682 GMFD 0 0 6467 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 661477 ROW-MUTATION-STAGE0 0 998780 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 4 FLUSH-WRITER-POOL 0 0 4 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 0 0 3 EQX r...@cass04:~# vmstat -n 1 procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 6 10 7096 121816 16244 1037549200 1 300 5 1 94 0 0 2 10 7096 116484 16248 1038114400 5636 4 21210 9820 2 1 79 18 0 1 9 7096 108920 16248 1038759200 6216 0 21439 9878 2 1 81 16 0 0 9 7096 129108 16248 1036485200 6024 0 23280 8753 2 1 80 17 0 2 9 7096 122460 16248 1037090800 6072 0 20835 9461 2 1 83 14 0 2 8 7096 115740 16260 1037575200 5168 292 21049 9511 3 1 77 20 0 1 10 7096 108424 16260 1038230000 6244 0 21483 8981 2 1 75 22 0 3 8 7096 125028 16260 1036410400 5584 0 21238 9436 2 1 81 16 0 3 9 7096 117928 16260 1037006400 5988 0 21505 10225 2 1 77 19 0 1 8 7096 109544 16260 1037664000 634028 20840 8602 2 1 80 18 0 0 9 7096 127028 16240 1035765200 5984 0 20853 9158 2 1 79 18 0 9 0 7096 121472 16240 1036349200 5716 0 20520 8489 1 1 82 16 0 3 9 7096 112668 16240 1036987200 6404 0 21314 9459 2 1 84 13 0 1 9 7096 127300 16236 1035344000 5684 0 38914 10068 2 1 76 21 0 But the 16 cores are hardly utilized. Which indicates to me there is some bad thread thrashing, but why? 1 [| 8.3%] Tasks: 1070 total, 1 running 2 [0.0%] Load average: 8.34 9.05 8.82 3 [0.0%] Uptime: 192 days(!), 15:29:52 4 [|||17.9%] 5 [| 5.7%] 6 [|| 1.3%] 7 [|| 2.6%] 8 [| 0.6%] 9 [| 0.6%] 10 [|| 1.9%] 11 [|| 1.9%] 12 [|| 1.9%] 13 [|| 1.3%] 14 [| 0.6%] 15 [||
Re: what causes MESSAGE-DESERIALIZER-POOL to spike
So after 4096 messages get pushed on the row-read-stage queue (or any other multiThreadedStage) the deserializer basically becomes a single-threaded blocking queue that prevents any other inter-node RPC from occurring..? Sounds like it's a problem either way. If the row read stage is what's backed up, why not have the messages stack up on that stage? Mike On Wed, Aug 4, 2010 at 11:46 AM, Jonathan Ellis jbel...@gmail.com wrote: No, MDP is backing up because Row-Read-Stage [the stage after MDP on reads] is full at 4096, meaning you're not able to process reads as quickly as the requests are coming in. On Wed, Aug 4, 2010 at 2:21 PM, Mike Malone m...@simplegeo.com wrote: This may be your problem: https://issues.apache.org/jira/browse/CASSANDRA-1358 The message deserializer executor is being created with a core pool size of 1. Since it uses a queue with unbounded capacity new requests are always queued and the thread pool never grows. So the message deserializer becomes a single-threaded bottleneck through which all traffic must pass. So your 16 cores are reduced to one core for handling all inter-node communication (and any intra-node communication that's being passed through the messaging service). Mike On Tue, Aug 3, 2010 at 10:02 PM, Dathan Pattishall datha...@gmail.com wrote: The output of htop shows threads as procs with a breakdown of how much cpu /etc per thread (in ncurses color!). All of these Java procs are just Java threads of only 1 instance of Cassandra per Server. On Sat, Jul 31, 2010 at 3:45 PM, Benjamin Black b...@b3k.us wrote: Sorry, I just noticed: are you running 14 instances of Cassandra on a single physical machine or are all those java processes something else? On Mon, Jul 26, 2010 at 12:22 PM, Dathan Pattishall datha...@gmail.com wrote: I have 4 nodes on enterprise type hardware (Lots of Ram 12GB, 16 i7 cores, RAID Disks). ~# /opt/cassandra/bin/nodetool --host=localhost --port=8181 tpstats Pool NameActive Pending Completed STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 516280 ROW-READ-STAGE8 40961164326 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 16820081818682 GMFD 0 0 6467 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 661477 ROW-MUTATION-STAGE0 0 998780 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 4 FLUSH-WRITER-POOL 0 0 4 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 0 0 3 EQX r...@cass04:~# vmstat -n 1 procs ---memory-- ---swap-- -io --system-- -cpu-- r b swpd free buff cache si sobibo in cs us sy id wa st 6 10 7096 121816 16244 1037549200 1 300 5 1 94 0 0 2 10 7096 116484 16248 1038114400 5636 4 21210 9820 2 1 79 18 0 1 9 7096 108920 16248 1038759200 6216 0 21439 9878 2 1 81 16 0 0 9 7096 129108 16248 1036485200 6024 0 23280 8753 2 1 80 17 0 2 9 7096 122460 16248 1037090800 6072 0 20835 9461 2 1 83 14 0 2 8 7096 115740 16260 1037575200 5168 292 21049 9511 3 1 77 20 0 1 10 7096 108424 16260 1038230000 6244 0 21483 8981 2 1 75 22 0 3 8 7096 125028 16260 1036410400 5584 0 21238 9436 2 1 81 16 0 3 9 7096 117928 16260 1037006400 5988 0 21505 10225 2 1 77 19 0 1 8 7096 109544 16260 1037664000 634028 20840 8602 2 1 80 18 0 0 9 7096 127028 16240 1035765200 5984 0 20853 9158 2 1 79 18 0 9 0 7096 121472 16240 1036349200 5716 0 20520 8489 1 1 82 16 0 3 9 7096 112668 16240 1036987200 6404 0 21314 9459 2 1 84 13 0 1 9 7096 127300 16236 1035344000 5684 0 38914 10068 2 1 76 21 0 But the 16 cores are hardly utilized. Which indicates to me there is some bad thread thrashing, but why? 1 [| 8.3%] Tasks: 1070 total, 1 running 2 [0.0%] Load average: 8.34 9.05 8.82 3
Re: get_range_slices
I think the answer to your question is no, you shouldn't. I'm feeling far too lazy to do even light research on the topic, but I remember there being a bug where replicas weren't consolidated and you'd get a result set that included data from each replica that was consulted for a query. That could be what you're seeing. Are you running the most recent release? Trying dropping to CL.ONE and see if you only get one copy. If that fixes it, I'd suggest searching JIRA. Mike On Thu, Jul 8, 2010 at 6:40 PM, Jonathan Shook jsh...@gmail.com wrote: Should I ever expect multiples of the same key (with non-empty column sets) from the same get_range_slices call? I've verified that the column data is identical byte-for-byte, as well, including column timestamps?
Re: Coke Products at Digg?
On Wed, Jul 7, 2010 at 8:17 AM, Eric Evans eev...@rackspace.com wrote: I heard a rumor that Digg was moving away from Coca-Cola products in all of its vending machines and break rooms. Can anyone from Digg comment on this? My near-term beverage consumption strategy is based largely on my understanding of Digg's, so if there has been a change, I may need to reevaluate. Not sure about Digg, but I heard Twitter is switching over to Fanta. It's been adopted by Coke so it must be fairly stable. There's not as much flexibility in the product lineup, but what they do offer is extremely delicious. Just my $0.02. Mike
Re: Coke Products at Digg?
On Wed, Jul 7, 2010 at 8:55 AM, Miguel Verde miguelitov...@gmail.comwrote: Dr. Pepper has recently been picked up by Coca Cola as well. I wonder if the UnCola solutions like 7Up and Fanta are just a fad? I'm on the fence. I mean, there's really nothing wrong with a nice cold Coke to satiate your thirst. But we've all been drinking cola-flavored beverages for so long I think they've become a hammer, so to speak. Can't hurt to shake things up a bit. Let's be real here: if you're thirsty, you should be drinking water. Coffee or teas are more effective at delivering caffeine. And who wants to sit down to a big steak dinner with a glass of Cola? A nice red wine is a much better tool for the job. Horses for courses, that's my take. Seems to me the carbonated beverage manufacturers are just starting to realize that they can flavor their drinks with something other than the cola-blend that Angelo Mariani invented in 1863! Mike On Wed, Jul 7, 2010 at 10:50 AM, Mike Malone m...@simplegeo.com wrote: On Wed, Jul 7, 2010 at 8:17 AM, Eric Evans eev...@rackspace.com wrote: I heard a rumor that Digg was moving away from Coca-Cola products in all of its vending machines and break rooms. Can anyone from Digg comment on this? My near-term beverage consumption strategy is based largely on my understanding of Digg's, so if there has been a change, I may need to reevaluate. Not sure about Digg, but I heard Twitter is switching over to Fanta. It's been adopted by Coke so it must be fairly stable. There's not as much flexibility in the product lineup, but what they do offer is extremely delicious. Just my $0.02. Mike
Re: Cassandra and Thrift on the Server Side
Still, to Clint's point, everyone knows how to make an HTTP request. If you want a cassandra client running on, let's say, an iPhone for some reason, a REST API is going to be a lot more straight forward to implement. There's no reason an HTTP service would have to live inside the Cassandra project though, right... we're just talking about a proxy that translates from one protocol (HTTP) to another (thrift / avro). Shouldn't be too hard to implement. It could even be open sourced, and referenced from the Cassandra website, maybe even endorsed by the Cassandra project. High level though I think it's important to resist the temptation to build things in that could just as easily live separately and develop orthogonally. I feel the same way about access control... I think it's more natural and flexible for that to be handled in an application rather than in the database... If your particular requirements end up pushing access control back to the data store tier than it should be fairly easy to wrap the Cassandra service at either the Java level (by subclassing) or the OS level (by having Cassandra listen only on localhost and have an authenticating / authorizing proxy listen for remote requests forward). But it looks like that decision has already been made. Mike
Re: Are 6..8 seconds to read 23.000 small rows - as it should be?
Yes, I know. And I might end up doing this in the end. I do though have pretty hard upper limits of how many rows I will end up with for each key, but anyways it might be a good idea none the less. Thanks for the advice on that one. You set count to Integer.MAX. Did you try with say 3? IIRC that makes a difference (while it shouldn't) even when you have still less than 3. Er, really? Just off hand, I feel like I've looked through most of the code that would be relevant and I can't think of any reason that would be the case. If it is, that definitely seems like a bug, particularly since the general strategy for fetching all the things in this row is to set count to Integer.MAX_VALUE! Mike
Re: Is SuperColumn necessary?
On Tue, May 11, 2010 at 7:46 AM, David Boxenhorn da...@lookin2.com wrote: I would like an API with a variable number of arguments. Using Java varargs, something like value = keyspace.get(articles, cars, John Smith, 2010-05-01, comment-25); or valueArray = keyspace.get(articles, predicate1, predicate2, predicate3, predicate4); Hrm. I haven't dug that deeply into the joys of predicate logic, propositional DAGs, etc. but couldn't this also be represented as a nested tree of predicates / other primitives. So it would be something like: SubColumns = Transformation that takes a predicate, applies it to a Column, then gets it's SubColumns keyspace.get(articles, SubColumns(predicate1, SubColumns(predicate2, SubColumns(predicate3, predicate4; It's more like functional programming-ish, I suppose, but I think that model might apply more cleanly here. FP does tend to result in nice clean algorithms for manipulating large data sets. Mike The storage layout would be determined by the configuration, as below: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True ... On Tue, May 11, 2010 at 5:26 PM, Jonathan Shook jsh...@gmail.com wrote: This is one of the sticking points with the key concatenation argument. You can't simply access subpartitions of data along an aggregate name using a concatenated key unless you can efficiently address a range of the keys according to a property of a subset. I'm hoping this will bear out with more of this discussion. Another facet of this issue is performance with respect to storage layout. Presently columns within a row are inherently organized for efficient range operations. The key space is not generally optimal in this way. I'm hoping to see some discussion of this, as well. On Tue, May 11, 2010 at 6:17 AM, vd vineetdan...@gmail.com wrote: Hi Can we make range search on ID:ID format as this would be treated as single ID by API or can it bifurcate on ':' . If now then how do can we ignore usage of supercolumns where we need to associate 'n' number of rows to a single ID. Like CatID1- articleID1 CatID1- articleID2 CatID1- articleID3 CatID1- articleID4 How can we map such scenarios with simple column families. Rgds. On Tue, May 11, 2010 at 2:11 PM, Torsten Curdt tcu...@vafer.org wrote: Exactly. On Tue, May 11, 2010 at 10:20, David Boxenhorn da...@lookin2.com wrote: Don't think of it as getting rid of supercolum. Think of it as adding superdupercolums, supertriplecolums, etc. Or, in sparse array terminology: array[dim1][dim2][dim3].[dimN] = value Or, as said above: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column
Re: How to write WHERE .. LIKE query ?
On Tue, May 11, 2010 at 8:54 AM, Schubert Zhang zson...@gmail.com wrote: In the future, maybe cassandra can provide some Filter or Coprocessor interfaces. Just like what of Bigtable do. But now, cassandra is too young, there are many things to do for a clear core. There's been talk of adding coprocessors. It will probably happen one day. Unfortunately, that day is probably a ways off. Mike On Tue, May 11, 2010 at 11:35 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 11:36 PM, vd vineetdan...@gmail.com wrote: Hi Mike AFAIK cassandra queries only on keys and not on column names, please verify. Incorrect. You can slice a row or rows (identified by a key) on a column name range (e.g., a through m) or ask for specific columns in a row or rows (e.g., please give me the first_name, last_name and hashed_password fields from my Users column family where the key equals mmalone). See the get_range_slices() method in the thrift service. Mike On Tue, May 11, 2010 at 11:06 AM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:00 PM, Shuge Lee shuge@gmail.com wrote: Hi all: How to write WHERE ... LIKE query ? For examples(described in Python): Schema: # columnfamily name resources = [ # key 'foo': { # columns and value 'url': 'foo.com', 'pushlier': 'foo', }, 'oof': { 'url': 'oof.com', 'pushlier': 'off', }, # ... , } # this is very easy, SELECT * FROM KEY = 'foo' but following are really hard: SELECT * FROM resources WHERE key LIKE 'o%' # get all records which key name contains character 'o'? get_range_slices(keyspace, ColumnParent(column_family), SlicePredicate(slice_range=SliceRange('',''), KeyRange('o', 'o~'), ConsistencyLevel.ONE); SELECT * FROM resources WHERE url == 'oof.com' This is a projection. Cassandra doesn't support this sort of query out of the box. You'll have to structure your data so that data you want to query by is in the key or column name. Or you'll have to manually build secondary indexes. Mike
Re: Is SuperColumn necessary?
Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.comwrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way, but it wouldn't be converted into a 4GL. Mike Jonathan On Mon, May 10, 2010 at 11:37 AM, Mike Malone m...@simplegeo.com wrote: Maybe... but honestly, it doesn't affect the architecture or interface at all. I'm more interested in thinking about how the system should work than what things are called. Naming things are important, but that can happen later. Does anyone have any thoughts or comments on the architecture I suggested earlier? Mike On Mon, May 10, 2010 at 8:36 AM, Schubert Zhang zson...@gmail.com wrote: Yes, the column here is not appropriate. Maybe we need not to create new terms, in Google's Bigtable, the term qualifier is a good one. On Thu, May 6, 2010 at 3:04 PM, David Boxenhorn da...@lookin2.com wrote: That would be a good time to get rid of the confusing column term, which incorrectly suggests a two-dimensional tabular structure. Suggestions: 1. A hypercube (or hypocube, if only two dimensions): replace key and column with 1st dimension, 2nd dimension, etc. 2. A file system: replace key and column with directory and subdirectory 3. A tuple tree: Column family replaced by top-level tuple, whose value is the set of keys, whose value is the set of supercolumns of the key, whose value is the set of columns for the supercolumn, etc. 4. Etc. On Thu, May 6, 2010 at 2:28 AM, Mike Malone m...@simplegeo.com wrote: Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up
Re: Is SuperColumn necessary?
On Mon, May 10, 2010 at 4:31 PM, AJ Chen ajc...@web2express.org wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. There's nothing you said here that can't be implemented efficiently using columns. You can slice rows and get a subset of Columns. In fact, this example is particularly easy to implement. If you have a Blog with Entries and Comments you'd do: ColumnFamily Name=Blog CompareWith=UTF8Type / Insert blog post: batch_mutate(key=blog post id, [{name=~post:author, value=author}, {name=~post:title, value=title, ...)) Insert comment: batch_mutate(key=blog post id, [{name=TimeUUID + :author, ... }] Then you can get the Post only (slice for [~, ]), the comments only (slice for [, ~]), or the post _and_ comments (slice for [, ]). Inserting a comment does _not_ require a get/concatenate/insert. Yes, concatenating the names on the client side is hacky, clunky, and inconvenient. That's why we _should_ build an interface that doesn't require the client to concatenate names. But SuperColumns aren't the right way to do it. They add no value. They could be implemented in client libraries, for example, and nobody would know the difference. To really understand the problem with SuperColumns, though, you need to look at the Cassandra source. Removing SuperColumns would make the code-base much cleaner and tighter, and would probably reduce SLOC by 20%. I think a replacement that assumed nested Columns (or Entries, or Thingies) would be much cleaner. That's what Stu is working on. Mike On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.org wrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm suggesting that how it works and how we use it are not always exactly the same. At least they should both hinge on a common conceptual model, which is where the naming becomes an important anchoring point. If things are done properly, client libraries could expose simplified query interfaces without much effort. Most ORMs these days work by building a propositional directed acyclic graph that's serialized to SQL. This would work the same way
Re: Is SuperColumn necessary?
Mike just suggested to concate comment id with each of the comment field names so that the above data can be stored in normal column family. It looks fine except that I'm not sure the time sorting on comments still works or not. In the case of time you can just use lexicographically sortable strings that represent your timestamp (e.g., RFC 3339). You're right, I don't think TimeUUID does that. For more complicated things (e.g., TimeUUIDs or packed numerics that you don't want to zero pad) you'd have to implement a custom comparator. So the convenience mechanisms that would have to be implemented (and, in fact, Stu and Ed have pretty much already implemented) would take care of concatenating the column names and doing the chained comparisons for you. Mike On Mon, May 10, 2010 at 5:36 PM, William Ashley wash...@gmail.com wrote: I'm having a difficult time understanding your syntax. Could you provide an example with actual data? On May 10, 2010, at 5:25 PM, AJ Chen wrote: your suggestion works for fixed supercolumn name. the blog example now becomes: { blog-id {name, title, ...} blog-id-comments {time:commenter} } what about supercolumn names that are not fixed? for example, I want to store comment's details with the blog like this: { blog-id { blog { name, title, ...} comments {comment-id:commenter} comment-id {commenter, time, text, ...} } a comment-id is generated on-the-fly when the comment is made. how do you flatten the comment-id supercolumn to normal column? just for brain exercise, not meant to pick on you. thanks, -aj On Mon, May 10, 2010 at 4:39 PM, William Ashley wash...@gmail.comwrote: If you're storing your super column under a fixed name, you could just concatenate that name with the row key and use normal columns. Then you get your paging and sorting the way you want it. On May 10, 2010, at 4:31 PM, AJ Chen wrote: supercolumn is good for modeling profile type of data. simple example is blog: blog { blog {author, title, ...} comments {time: commenter} //sort by TimeUUID } when retrieving a blog, you get all the comments sorted by time already. without supercolumn, you would need to concatenate multiple comment times together as you suggested. requiring user to concatenating data fields together is not only an extra burden on user but also a less clean design. there will be cases where the list property of a profile data is a long list (say a million items). in such cases, user wants to be able to directly insert/delete an item in that list because it's more efficient. Retrieving the whole list, updating it, concatenating again, and then putting it back to datastore is awkward and less efficient. -aj On Mon, May 10, 2010 at 2:20 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 1:38 PM, AJ Chen ajc...@web2express.orgwrote: Could someone confirm this discussion is not about abandoning supercolumn family? I have found modeling data with supercolumn family is actually an advantage of cassadra compared to relational database. Hope you are going to drop this important concept. How it's implemented internally is a different matter. SuperColumns are useful as a convenience mechanism. That's pretty much it. There's _nothing_ (as far as I can tell) that you can do with SuperColumns that you can't do by manually concatenating key names with a separator on the client side and implementing a custom comparator on the server (as ugly as that is). This discussion is about getting rid of SuperColumns and adding a more generic mechanism that will actually be useful and interesting and will continue to be convenient for the types of use cases for which people use SuperColumns. If there's a particular use case that you feel you can only implement with SuperColumns, please share! I honestly can't think of any. Mike On Mon, May 10, 2010 at 10:08 AM, Jonathan Shook jsh...@gmail.comwrote: Agreed On Mon, May 10, 2010 at 12:01 PM, Mike Malone m...@simplegeo.com wrote: On Mon, May 10, 2010 at 9:52 AM, Jonathan Shook jsh...@gmail.com wrote: I have to disagree about the naming of things. The name of something isn't just a literal identifier. It affects the way people think about it. For new users, the whole naming thing has been a persistent barrier. I'm saying we shouldn't be worried too much about coming up with names and analogies until we've decided what it is we're naming. As for your suggestions, I'm all for simplifying or generalizing the how it works part down to a more generalized set of operations. I'm not sure it's a good idea to require users to think in terms building up a fluffy query structure just to thread it through a needle of an API, even for the simplest of queries. At some point, the level of generic boilerplate takes away from the semantic hand rails that developers like. So I guess I'm
Re: Is SuperColumn necessary?
On Thu, May 6, 2010 at 5:38 PM, Vijay vijay2...@gmail.com wrote: I would rather be interested in Tree type structure where supercolumns have supercolumns in it. you dont need to compare all the columns to find a set of columns and will also reduce the bytes transfered for separator, at least string concatenation (Or something like that) for read and write column name generation. it is more logically stored and structured by this way and also we can make caching work better by selectively caching the tree (User defined if you will) But nothing wrong in supporting both :) I'm 99% sure we're talking about the same thing and we don't need to support both. How names/values are separated is pretty irrelevant. It has to happen somewhere. I agree that it'd be nice if it happened on the server, but doing it in the client makes it easier to explore ideas. On Thu, May 6, 2010 at 5:27 PM, philip andrew philip14...@gmail.com wrote: Please create a new term word if the existing terms are misleading, if its not a file system then its not good to call it a file system. While it's seriously bikesheddy, I guess you're right. Let's call them thingies for now, then. So you can have a top-level thingy and it can have an arbitrarily nested tree of sub-thingies. Each thingy has a thingy type [1]. You can also tell Cassandra if you want a particular level of thingy to be indexed. At one (or maybe more) levels you can tell Cassandra you want your thingies to be split onto separate nodes in your cluster. At one (or maybe more) levels you could also tell Cassandra that you want your thingies split into separate files [2]. The upshot is, the Cassandra data model would go from being it's a nested dictionary, just kidding no it's not! to being it's a nested dictionary, for serious. Again, these are all just ideas... but I think this simplified data model would allow you to express pretty much any query in a graph of simple primitives like Predicates, Filters, Aggregations, Transformations, etc. The indexes would allow you to cheat when evaluating certain types of queries - if you get a SlicePredicate on an indexed thingy you don't have to enumerate the entire set of sub-thingies for example. So, you'd query your thingies by building out a predicate, transformations, filters, etc., serializing the graph of primitives, and sending it over the wire to Cassandra. Cassandra would rebuild the graph and run it over your dataset. So instead of: Cassandra.get_range_slices( keyspace=AwesomeApp, column_parent=ColumnParent(column_family=user), slice_predicate=SlicePredicate(column_names=['username', 'dob']), range=KeyRange(start_key='a', end_key='m'), consistency_level=ONE ) You'd do something like: Cassandra.query( SubThingyTransformer( NamePredicate(names=[AwesomeApp], SubThingyTransformer( NamePredicate(names=[user]), SubThingyTransformer( SlicePredicate(start=a, end=m), NamePredicate(names=[username, dob]) ) ) ), consistency_level=ONE ) Which seems complicated, but it's basically just [(user['username'], user['dob']) for user in Cassandra['AwesomeApp']['user'].slice('a', 'm')] and could probably be expressed that way in a client library. I think batch_mutate is awesome the way it is and should be the only way to insert/update data. I'd rename it mutate. So our interface becomes: Cassandra.query(query, consistency_level) Cassandra.mutate(mutation, consistency_level) Ta-da. Anyways, I was trying to avoid writing all of this out in prose and try mocking some of it up in code instead. I guess this this works too. Either way, I do think something like this would simplify the codebase, simplify the data model, simplify the interface, make the entire system more flexible, and be generally awesome. Mike [1] These can be subclasses of Thingy in Java... or maybe they'd implement IThingy. But either way they'd handle serialization and probably implement compareTo to define natural ordering. So you'd have classes like ASCIIThingy, UTF8Thingy, and LongThingy (ahem) - these would replace comparators. [2] I think there's another simplification here. Splitting into separate files is really very similar to splitting onto separate nodes. There might be a way around some of the row size limitations with this sort of concept. And we may be able to get better utilization of multiple disks by giving each disk (or data directory) a subset of the node's token range. Caveat: thought not fully baked.
Re: pagination through slices with deleted keys
On Fri, May 7, 2010 at 5:29 AM, Joost Ouwerkerk jo...@openplaces.orgwrote: +1. There is some disagreement on whether or not the API should return empty columns or skip rows when no data is found. In all of our use cases, we would prefer skipped rows. And based on how frequently new cassandra users appear to be confused about the current behaviour, this might be a more common use case than the need for empty cols. Perhaps this could be added as an option on SlicePredicate ? (e.g. skipEmpty=true). That's exactly how we implemented it: struct SlicePredicate { 1: optional listbinary column_names, 2: optional SliceRange slice_range, 3: optional bool ignore_empty_rows=0, } Mike
Re: pagination through slices with deleted keys
Our solution at SimpleGeo has been to hack Cassandra to (optionally, at least) be sensible and drop Rows that don't have any Columns. The claim from the FAQ that Cassandra would have to check if there are any other columns in the row is inaccurate. The common case for us at least is that we're only interested in Rows that have Columns matching our predicate. So if there aren't any, we just don't return that row. No need to check if the entire row is deleted. Mike On Thu, May 6, 2010 at 9:17 AM, Ian Kallen spidaman.l...@gmail.com wrote: I read the DistributedDeletes and the range_ghosts FAQ entry on the wiki which do a good job describing how difficult deletion is in an eventually consistent system. But practical application strategies for dealing with it aren't there (that I saw). I'm wondering how folks implement pagination in their applications; if you want to render N results in an application, is the only solution to over-fetch and filter out the tombstones? Or is there something simpler that I overlooked? I'd like to be able to count (even if the counts are approximate) and fetch rows with the deleted ones filtered out (without waiting for the GCGraceSeconds interval + compaction) but from what I see so far, the burden is on the app to deal with the tombstones. -Ian
Re: pagination through slices with deleted keys
On Thu, May 6, 2010 at 3:27 PM, Ian Kallen spidaman.l...@gmail.com wrote: Cool, is this a patch you've applied on the server side? Are you running 0.6.x? I'm wondering if this kind of thing can make it into future versions of Cassandra. Yea, server side. It's basically doing the same thing clients typically want to do (again, at least for our use cases) but doing it closer to the data. Our patch is kind of janky though. I can probably get some version of it pushed back upstream - or at least on github or something - if there's any interest. Mike
Re: Is SuperColumn necessary?
Nice, Ed, we're doing something very similar but less generic. Now replace all of the various methods for querying with a simple query interface that takes a Predicate, allow the user to specify (in storage-conf) which levels of the nested Columns should be indexed, and completely remove Comparators and have people subclass Column / implement IColumn and we'd really be on to something ;). Mock storage-conf.xml: Column Name=ThingThatsNowKey Indexed=True ClusterPartitioned=True Type=UTF8 Column Name=ThingThatsNowColumnFamily DiskPartitioned=True Type=UTF8 Column Name=ThingThatsNowSuperColumnName Type=Long Column Name=ThingThatsNowColumnName Indexed=True Type=ASCII Column Name=ThingThatCantCurrentlyBeRepresented/ /Column /Column /Column /Column Thrift: struct NamePredicate { 1: required listbinary column_names, } struct SlicePredicate { 1: required binary start, 2: required binary end, } struct CountPredicate { 1: required struct predicate, 2: required i32 count=100, } struct AndPredicate { 1: required Predicate left, 2: required Predicate right, } struct SubColumnsPredicate { 1: required Predicate columns, 2: required Predicate subcolumns, } ... OrPredicate, OtherUsefulPredicates ... query(predicate, count, consistency_level) # Count here would be total count of leaf values returned, whereas CountPredicate specifies a column count for a particular sub-slice. Not fully baked... but I think this could really simplify stuff and make it more flexible. Downside is it may give people enough rope to hang themselves, but at least the predicate stuff is easily distributable. I'm thinking I'll play around with implementing some of this stuff myself if I have any free time in the near future. Mike On Wed, May 5, 2010 at 2:04 PM, Jonathan Ellis jbel...@gmail.com wrote: Very interesting, thanks! On Wed, May 5, 2010 at 1:31 PM, Ed Anuff e...@anuff.com wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach. http://github.com/edanuff/CassandraCompositeType Ed On Wed, Apr 28, 2010 at 12:52 PM, Ed Anuff e...@anuff.com wrote: It might make sense to create a CompositeType subclass of AbstractType for the purpose of constructing and comparing these types of composite column names so that if you could more easily do that sort of thing rather than having to concatenate into one big string. On Wed, Apr 28, 2010 at 10:25 AM, Mike Malone m...@simplegeo.com wrote: The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Is SuperColumn necessary?
On Wed, Apr 28, 2010 at 5:24 AM, David Boxenhorn da...@lookin2.com wrote: If I understand correctly, the distinction between supercolumns and subcolumns is critical to good database design if you want to use random partitioning: you can do range queries on subcolumns but not on supercolumns. Is this correct? You can do efficient range queries of normal (not super) columns in a ColumnFamily. I think SuperColumn's are not indexed, so it's less efficient to do a slice of subcolumns from a column, if there are lots of subcolumns. I agree that SuperColumns are technically unnecessary. There aren't any use cases I can come up with that a SuperColumn satisfies that normal Columns can't. You can simulate SuperColumn behavior by concatenating key parts with a separator and using the concatenated key as your column name, then doing a slice. So if you had a SuperColumn that stored usernames, and sub-columns that stored document IDs, you could instead have a normal CF that stores username:document-id. The only thing SuperColumns appear to buy you (as someone pointed out to me at the Cassandra meetup - I think it was Eric Florenzano) is that you can use different comparator types for the Super/SubColumns, I guess..? But you should be able to do the same thing by creating your own Column comparator. I guess my point is that SuperColumns are mostly a convenience mechanism, as far as I can tell. Mike
Re: At what point does the cluster get faster than the individual nodes?
On Wed, Apr 21, 2010 at 9:50 AM, Mark Greene green...@gmail.com wrote: Right it's a similar concept to DB sharding where you spread the write load around to different DB servers but won't necessarily increase the throughput of an one DB server but rather collectively. Except with Cassandra, read-repair causes every read to go to every replica for a piece of data. Mike
Re: timestamp not found
Looks like the timestamp, in this case, is 0. Does Cassandra allow zero timestamps? Could be a bug in Cassandra doing an implicit boolean coercion in a conditional where it shouldn't. Mike On Thu, Apr 15, 2010 at 8:39 AM, Lee Parker l...@socialagency.com wrote: We are currently migrating about 70G of data from mysql to cassandra. I am occasionally getting the following error: Required field 'timestamp' was not found in serialized data! Struct: Column(name:74 65 78 74, value:44 61 73 20 6C 69 65 62 20 69 63 68 20 76 6F 6E 20 23 49 6E 61 3A 20 68 74 74 70 3A 2F 2F 77 77 77 2E 79 6F 75 74 75 62 65 2E 63 6F 6D 2F 77 61 74 63 68 3F 76 3D 70 75 38 4B 54 77 79 64 56 77 6B 26 66 65 61 74 75 72 65 3D 72 65 6C 61 74 65 64 20 40 70 6A 80 01 00 01 00, timestamp:0) The loop which is building out the mutation map for the batch_mutate call is adding a timestamp to each column. I have verified that the time stamp is there for several calls and I feel like if the logic was bad, i would see the error more frequently. Does anyone have suggestions as to what may be causing this? Lee Parker l...@spredfast.com [image: Spredfast]
Re: Reading thousands of columns
On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis jbel...@gmail.com wrote: 35-50ms for how many rows of 1000 columns each? get_range_slices does not use the row cache, for the same reason that oracle doesn't cache tuples from sequential scans -- blowing away 1000s of rows worth of recently used rows queried by key, for a swath of rows from the scan, is the wrong call more often than it is the right one. Couldn't you cache a list of keys that were returned for the key range, then cache individual rows separately or not at all? By blowing away rows queried by key I'm guessing you mean pushing them out of the LRU cache, not explicitly blowing them away? Either way I'm not entirely convinced. In my experience I've had pretty good success caching items that were pulled out via more complicated join / range type queries. If your system is doing lots of range quereis, and not a lot of lookups by key, you'd obviously see a performance win from caching the range queries. Maybe range scan caching could be turned on separately? Mike
Re: How do vector clocks and conflicts work?
On Tue, Apr 6, 2010 at 11:03 AM, Tatu Saloranta tsalora...@gmail.comwrote: On Tue, Apr 6, 2010 at 8:45 AM, Mike Malone m...@simplegeo.com wrote: As long as the conflict resolver knows that two writers each tried to increment, then it can increment twice. The conflict resolver must know about the semantics of increment or decrement or string append or binary patch or whatever other merge strategy you choose. You'll register your strategy with Cassandra and it will apply it. Presumably it will also maintain enough context about what you were trying to accomplish to allow the merge strategy plugin to do it properly. That is to say, my understanding was that vector clocks would be required but not sufficient for reconciliation of concurrent value updates. The way I envisioned eventually consistent counters working would require something slightly more sophisticated... but not too bad. As incr/decr operations happen on distributed nodes, each node would keep a (vector clock, delta) tuple for that node's local changes. When a client fetched the value of the counter the vector clock deltas and the reconciled count would be combined into a single result. Similarly, when a replication / hinted-handoff / read-repair reconciliation occurred the counts would be merged into a single (vector clock, count) tuple. Maybe there's a more elegant solution, but that's how I had been thinking about this particular problem. I doubt there is any simple and elegant solution -- if there was, it would have been invented in 50s if there was. :-) Given this, yes, something along these lines sounds realistic. It also sounds like implementation would greatly benefit (if not require) foundational support from core, as opposed to being done outside of Cassandra (which I understand you are suggesting). I wasn't sure if the idea was to try to do this completely separate (aside from vector clock support). I'd probably put it in core. Or at least put some more generic support for this sort of conflict resolution in core. I'm looking forward to seeing Digg's patch for this stuff. Mike
Re: Memcached protocol?
On Mon, Apr 5, 2010 at 1:46 PM, Paul Prescod p...@ayogo.com wrote: On Mon, Apr 5, 2010 at 1:35 PM, Mike Malone m...@simplegeo.com wrote: That's useful information Mike. I am a bit curious about what the most common use cases are for atomic increment/decrement. I'm familiar with atomic add as a sort of locking mechanism. They're useful for caching denormalized counts of things. Especially things that change rapidly. Instead of invalidating the counter whenever an event occurs that would incr/decr the counter, you can incr/decr the cached count too. Do you think that a future cassandra increment/decrement would be incompatible with those use cases? It seems to me that in that use case, an eventually consistent counter is as useful as any other eventually consistent datum. An eventually consistent count operation in Cassandra would be great, and it would satisfy all of the use cases I would typically use counts for in memcached. It's just a matter of reconciling inconsistencies with a more sophisticated operation than latest write wins (specifically, the reconciliation operation should apply all incr/decr ops). Mike
Re: Ring management and load balance
On Thu, Mar 25, 2010 at 9:56 AM, Jonathan Ellis jbel...@gmail.com wrote: The advantage to doing it the way Cassandra does is that you can keep keys sorted with OrderPreservingPartitioner for range scans. grabbing one token of many from each node in the ring would prohibit that. So we rely on active load balancing to get to a good enough balance, say within 50%. It doesn't need to be perfect. This makes sense for the order preserving partitioner. But for the random partitioner multiple tokens per node would certainly make balancing easier... I haven't dug into that bit of the Cassandra implementation yet. Would it be very difficult to support both modes of operation? For what it's worth, we've already seen annoying behavior when adding nodes to the cluster. It's obviously true that the absolute size of partitions becomes smaller as the cluster grows, but if your relatively balanced 100 node cluster is at, say, 70% capacity and you add 10 more nodes you would presumably want this additional capacity to be evenly distributed. And right now that's pretty much impossible to do without rebalancing the entire cluster. Mike