Re: Cluster per Application vs. Multi-Application Clusters
If you are staring out small one logical/physical cluster is probably the best and only approach. Long term this is very case by case dependent but I generally believe Cluster per Application is the best approach. Although I consider it Cluster per QOS For our use cases I find that two applications have very different data sizes and quality of service requirements. For example, one application may have a small dataset size and a high repeated read/ cache hit rate scenario. While another application may have a large sparse dataset and a random read pattern. Also one application may demand fast 3 ms reads while the other may find 10 or 20 ms reads acceptable. When those two applications are placed on the same set of hardware you end up scaling them both even though at a given time only one or the other needs to be scaled. In extreme cases application 1 and 2 cause contention and make each other unhappy. What is best to do is architect your systems in such a way that moving an individual column family to a new set of hardware is not difficult. This might involve something map reduce program that can bulk load existing data between two clusters, while your front end application can send the write/updates/deletes to both the old an the new cluster. Also make sure your application does not have too many hard coded touch points that assume a single cluster. As you mentioned one thing gained from keeping everything in the same keyspace is connection pooling. However unlike a RDBMS world where coordinated transactions have to happen in order, etc, etc that is not the case with C* so getting all data into the same physical system is not as important. On Wed, Aug 22, 2012 at 8:25 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Just an opinion here as we are having to do this ourselves loading tons of researchers datasets into one clusters. We are going the path of one keyspace as it makes it easier if you ever want to mine the data so you don't have to keep building different clients for another keyspace. We ended up adding our own security layer as well so researchers can expose their datasets to other researchers and once exposed, other researchers can join that data with their existing data. This of course is just one use case, but if 10 applications use cassandra, you still may find a benefit in having an 11th data mining app look at the data from all 10 apps. Later, Dean playOrm Developer From: Ersin Er ersin...@gmail.commailto:ersin...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, August 22, 2012 12:44 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Cluster per Application vs. Multi-Application Clusters Hi all, What are the advantages of allocating a cluster for a single application vs running multiple applications on the same cassandra cluster? Is any of the models suggested over the other? Thanks. -- Ersin Er
Re: Automating nodetool repair
You can consider adding -pr. When iterating through all your hosts like this. -pr means primary range, and will do less duplicated work. On Mon, Aug 27, 2012 at 8:05 PM, Aaron Turner synfina...@gmail.com wrote: I use cron. On one box I just do: for n in node1 node2 node3 node4 ; do nodetool -h $n repair sleep 120 done A lot easier then managing a bunch of individual crontabs IMHO although I suppose I could of done it with puppet, but then you always have to keep an eye out that your repairs don't overlap over time. On Mon, Aug 27, 2012 at 4:52 PM, Edward Sargisson edward.sargis...@globalrelay.net wrote: Hi all, So nodetool repair has to be run regularly on all nodes. Does anybody have any interesting strategies or tools for doing this or is everybody just setting up cron to do it? For example, one could write some Puppet code to splay the cron times around so that only one should be running at once. Or, perhaps, a central orchestrator that is given some known quiet time and works its way through the list, running nodetool repair one at a time (using RPC?) until it runs out of time. Cheers, Edward -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: Advantage of pre-defining column metadata
Setting the metadata will set the validation. If you insert to a column that is supposed to only INT values Cassandra will reject non INT data on insert time. Also comparator can not be changed, you only get once chance to set the column sorting. On Tue, Aug 28, 2012 at 3:34 PM, A J s5a...@gmail.com wrote: For static column family what is the advantage in pre-defining column metadata ? I can see ease of understanding type of values that the CF contains and that clients will reject incompatible insertion. But are there any major advantages in terms of performance or something else that makes it beneficial to define the metadata upfront ? Thanks.
Re: performance is drastically degraded after 0.7.8 -- 1.0.11 upgrade
If you move from 7.X to 0.8X or 1.0X you have to rebuild sstables as soon as possible. If you have large bloomfilters you can hit a bug where the bloom filters will not work properly. On Thu, Aug 30, 2012 at 9:44 AM, Илья Шипицин chipits...@gmail.com wrote: we are running somewhat queue-like with aggressive write-read patterns. I was looking for scripting queries from live Cassandra installation, but I didn't find any. is there something like thrift-proxy or other query logging/scripting engine ? 2012/8/30 aaron morton aa...@thelastpickle.com in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!) times slower than cassandra-0.7.8 We've not had any reports of a performance drop off. All tests so far have show improvements in both read and write performance. I agree, such digests save some network IO, but they seem to be very bad in terms of CPU and disk IO. The sha1 is created so we can diagnose corruptions in the -Data component of the SSTables. They are not used to save network IO. It is calculated while streaming the Memtable to disk so has no impact on disk IO. While not the fasted algorithm I would assume it's CPU overhead in this case is minimal. there's already relatively small Bloom filter file, which can be used for saving network traffic instead of sha1 digest. Bloom filters are used to test if a row key may exist in an SSTable. any explanation ? If you can provide some more information on your use case we may be able to help. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/08/2012, at 5:18 AM, Илья Шипицин chipits...@gmail.com wrote: in terms of our high-rate write load cassandra1.0.11 is about 3 (three!!) times slower than cassandra-0.7.8 after some investigation carried out I noticed files with sha1 extension (which are missing for Cassandra-0.7.8) in maybeWriteDigest() function I see no option fot switching sha1 digests off. I agree, such digests save some network IO, but they seem to be very bad in terms of CPU and disk IO. why to use one more digest (which have to be calculated), there's already relatively small Bloom filter file, which can be used for saving network traffic instead of sha1 digest. any explanation ? Ilya Shipitsin
Re: Helenos - web based gui tool
You might want to change the name. There is a node.js driver for cassandra with the same name. I am not sure which one of your got to the name first. On Thu, Sep 6, 2012 at 8:00 PM, aaron morton aa...@thelastpickle.com wrote: Thanks Tomek, Feel free to add it to http://wiki.apache.org/cassandra/Administration%20Tools Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/09/2012, at 9:54 AM, Tomek Kuprowski tomekkuprow...@gmail.com wrote: Dear all, I'm happy to announce a first release of Helenos. This is a web based gui tool to manage your data stored in Cassandra. Project site: https://github.com/tomekkup/helenos Some screens: https://picasaweb.google.com/tomekkuprowski/Helenos Hope you'll find it usefull. I'll be grateful for your comments and opinions. -- -- Regards ! Tomek Kuprowski
Re: cassandra performance looking great...
Try to get Cassandra running the TPH-C benchmarks and beat oracle :) On Fri, Sep 7, 2012 at 10:01 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So we wrote 1,000,000 rows into cassandra and ran a simple S-SQL(Scalable SQL) query of PARTITIONS n(:partition) SELECT n FROM TABLE as n WHERE n.numShares = :low and n.pricePerShare = :price It ran in 60ms So basically playOrm is going to support millions of rows per partition. This is great news. We expect the join performance to be very similar since the trees of pricePerShare and numShares are really no different than the join trees. So, millions of rows per partition and as many partitions as you want, it scales wonderfully…..CASSANDRA ROCKS Behind the scenes, there is a wide row per partition per index so the above query behind the scenes has two rows each with 1,000,000 columns. Later, Dean
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
Generally tuning the garbage collector is a waste of time. Just follow someone else's recommendation and use that. The problem with tuning is that workloads change then you have to tune again and again. New garbage collectors come out and you have to tune again and again. Someone at your company reads a blog about some new jvm and its awesomeness and you tune again and again, cassandra adds off heap caching you tune again and again. All this work takes a lot of time and usually results in negligible returns. Garbage collectors and tuning is not magic bullets. On Wednesday, September 12, 2012, Peter Schuller peter.schul...@infidyne.com wrote: Our full gc:s are typically not very frequent. Few days or even weeks in between, depending on cluster. *PER NODE* that is. On a cluster of hundreds of nodes, that's pretty often (and all it takes is a single node). -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: JVM 7, Cass 1.1.1 and G1 garbage collector
Haha Ok. It is not a total waste, but practically your time is better spent in other places. The problem is just about everything is a moving target, schema, request rate, hardware. Generally tuning nudges a couple variables in one direction or the other and you see some decent returns. But each nudge takes a restart and a warm up period, and with how Cassandra distributes requests you likely have to flip several nodes or all of them before you can see the change! By the time you do that its probably a different day or week. Essentially finding our if one setting is better then the other is like a 3 day test in production. Before c* I used to deal with this in tomcat. Once in a while we would get a dev that read some article about tuning, something about a new jvm, or collector. With bright eyed enthusiasm they would want to try tuning our current cluster. They spend a couple days and measure something and say it was good lower memory usage. Meanwhile someone else would come to me and say higher 95th response time. More short pauses, fewer long pauses, great taste, less filing. Most people just want to roflscale their huroku cloud. Tuning stuff is sysadmin work and the cloud has taught us that the cost of sysadmins are needless waste of money. Just kidding ! But I do believe the default cassandra settings are reasonable and typically I find that most who look at tuning GC usually need more hardware and actually need to be tuning something somewhere else. G1 is the perfect example of a time suck. Claims low pause latency for big heaps, and delivers something regarded by the Cassandra community (and hbase as well) that works worse then CMS. If you spent 3 hours switching tuning knobs and analysing, that is 3 hours of your life you will never get back. Better to let SUN and other people worry about tuning (at least from where I sit) On Saturday, September 15, 2012, Peter Schuller peter.schul...@infidyne.com wrote: Generally tuning the garbage collector is a waste of time. Sorry, that's BS. It can be absolutely critical, when done right, and only useless when done wrong. There's a spectrum in between. Just follow someone else's recommendation and use that. No, don't. Most recommendations out there are completely useless in the general case because someone did some very specific benchmark under very specific circumstances and then recommends some particular combination of options. In order to understand whether a particular recommendation applies to you, you need to know enough about your use-case that I suspect you're better of just reading up on the available options and figuring things out. Of course, randomly trying various different settings to see which seems to work well may be realistic - but you loose predictability (in the face of changing patterns of traffic for example) if you don't know why it's behaving like it is. If you care about GC related behavior you want to understand how the application behaves, how the garbage collector behaves, what your requirements are, and select settings based on those requirements and how the application and GC behavior combine to produce emergent behavior. The best GC options may vary *wildly* depending on the nature of your cluster and your goals. There are also non-GC settings (in the specific case of Cassandra) that affect the interaction with the garbage collector, like whether you're using row/key caching, or things like phi conviction threshold and/or timeouts. It's very hard for anyone to give generalized recommendations. If it weren't, Cassandra would ship with The One True set of settings that are always the best and there would be no discussion. It's very unfortunate that the state of GC in the freely available JVM:s is at this point given that there exists known and working algorithms (and at least one practical implementation) that avoids it, mostly. But, it's the situation we're in. The only way around it that I know of if you're on Hotspot, is to have the application behave in such a way that it avoids the causes of un-predictable behavior w.r.t. GC by being careful about it's memory allocation and *retention* profile. For the specific case of avoiding *ever* seeing a full gc, it gets even more complex. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: any ways to have compaction use less disk space?
If you are using ext3 there is a hard limit on number if files in a directory of 32K. EXT4 as a much higher limit (cant remember exactly IIRC). So true that having many files is not a problem for the file system though your VFS cache could be less efficient since you would have a higher inode-data ratio. Edward On Mon, Sep 24, 2012 at 7:03 PM, Aaron Turner synfina...@gmail.com wrote: On Mon, Sep 24, 2012 at 10:02 AM, Віталій Тимчишин tiv...@gmail.com wrote: Why so? What are pluses and minuses? As for me, I am looking for number of files in directory. 700GB/512MB*5(files per SST) = 7000 files, that is OK from my view. 700GB/5MB*5 = 70 files, that is too much for single directory, too much memory used for SST data, too huge compaction queue (that leads to strange pauses, I suppose because of compactor thinking what to compact next),... Not sure why a lot of files is a problem... modern filesystems deal with that pretty well. Really large sstables mean that compactions now are taking a lot more disk IO and time to complete. Remember, Leveled Compaction is more disk IO intensive, so using large sstables makes that even worse. This is a big reason why the default is 5MB. Also, each level is 10x the size as the previous level. Also, for level compaction, you need 10x the sstable size worth of free space to do compactions. So now you need 5GB of free disk, vs 50MB of free disk. Also, if you're doing deletes in those CF's, that old, deleted data is going to stick around a LOT longer with 512MB files, because it can't get deleted until you have 10x512MB files to compact to level 2. Heaven forbid it doesn't get deleted then because each level is 10x bigger so you end up waiting a LOT longer to actually delete that data from disk. Now, if you're using SSD's then larger sstables is probably doable, but even then I'd guesstimate 50MB is far more reasonable then 512MB. -Aaron 2012/9/23 Aaron Turner synfina...@gmail.com On Sun, Sep 23, 2012 at 8:18 PM, Віталій Тимчишин tiv...@gmail.com wrote: If you think about space, use Leveled compaction! This won't only allow you to fill more space, but also will shrink you data much faster in case of updates. Size compaction can give you 3x-4x more space used than there are live data. Consider the following (our simplified) scenario: 1) The data is updated weekly 2) Each week a large SSTable is written (say, 300GB) after full update processing. 3) In 3 weeks you will have 1.2TB of data in 3 large SSTables. 4) Only after 4th week they all will be compacted into one 300GB SSTable. Leveled compaction've tamed space for us. Note that you should set sstable_size_in_mb to reasonably high value (it is 512 for us with ~700GB per node) to prevent creating a lot of small files. 512MB per sstable? Wow, that's freaking huge. From my conversations with various developers 5-10MB seems far more reasonable. I guess it really depends on your usage patterns, but that seems excessive to me- especially as sstables are promoted. -- Best regards, Vitalii Tymchyshyn -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: 1000's of column families
Hector also offers support for 'Virtual Keyspaces' which you might want to look at. On Thu, Sep 27, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote: On Thu, Sep 27, 2012 at 3:11 PM, Hiller, Dean dean.hil...@nrel.gov wrote: We have 1000's of different building devices and we stream data from these devices. The format and data from each one varies so one device has temperature at timeX with some other variables, another device has CO2 percentage and other variables. Every device is unique and streams it's own data. We dynamically discover devices and register them. Basically, one CF or table per thing really makes sense in this environment. While we could try to find out which devices are similar, this would really be a pain and some devices add some new variable into the equation. NOT only that but researchers can register new datasets and upload them as well and each dataset they have they do NOT want to share with other researches necessarily so we have security groups and each CF belongs to security groups. We dynamically create CF's on the fly as people register new datasets. On top of that, when the data sets get too large, we probably want to partition a single CF into time partitions. We could create one CF and put all the data and have a partition per device, but then a time partition will contain multiple devices of data meaning we need to shrink our time partition size where if we have CF per device, the time partition can be larger as it is only for that one device. THEN, on top of that, we have a meta CF for these devices so some people want to query for streams that match criteria AND which returns a CF name and they query that CF name so we almost need a query with variables like select cfName from Meta where x = y and then select * from cfName where x. Which we can do today. How strict are your security requirements? If it wasn't for that, you'd be much better off storing data on a per-statistic basis then per-device. Hell, you could store everything in a single CF by using a composite row key: devicename|stat type|instance But yeah, there isn't a hard limit for the number of CF's, but there is overhead associated with each one and so I wouldn't consider your design as scalable. Generally speaking, hundreds are ok, but thousands is pushing it. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: Ball is rolling on High Performance Cassandra Cookbook second edition
Hello all, Work has begun on the second edition! Keep hitting me up with ideas. In particular I am looking for someone who has done work with flume+Cassandra and pig+Cassandra. Both of these things topics will be covered to some extent in the second edition, but these are two instances in which I could use some help as I do not have extensive experience with these two combinations. Contact me if you have any other ideas as well. Edward On Tue, Jun 26, 2012 at 5:25 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Hello all, It has not been very long since the first book was published but several things have been added to Cassandra and a few things have changed. I am putting together a list of changed content, for example features like the old per Column family memtable flush settings versus the new system with the global variable. My editors have given me the green light to grow the second edition from ~200 pages currently up to 300 pages! This gives us the ability to add more items/sections to the text. Some things were missing from the first edition such as Hector support. Nate has offered to help me in this area. Please feel contact me with any ideas and suggestions of recipes you would like to see in the book. Also get in touch if you want to write a recipe. Several people added content to the first edition and it would be great to see that type of participation again. Thank you, Edward
Re: MBean cassandra.db.CompactionManager TotalBytesCompacted counts backwards
I have not looked at this JMX object in a while, however the compaction manager can support multiple threads. Also it moves from 0-filesize each time it has to compact a set of files. That is more useful for showing current progress rather then lifetime history. On Fri, Oct 5, 2012 at 7:27 PM, Bryan Talbot btal...@aeriagames.com wrote: I've recently added compaction rate (in bytes / second) to my monitors for cassandra and am seeing some odd values. I wasn't expecting the values for TotalBytesCompacted to sometimes decrease from one reading to the next. It seems that the value should be monotonically increasing while a server is running -- obviously it would start again at 0 when the server is restarted or if the counter rolls over (unlikely for a 64 bit long). Below are two samples taken 60 seconds apart: the value decreased by 2,954,369,012 between the two readings. reported_metric=[timestamp:1349476449, status:200, request:[mbean:org.apache.cassandra.db:type=CompactionManager, attribute:TotalBytesCompacted, type:read], value:7548675470069] previous_metric=[timestamp:1349476389, status:200, request:[mbean:org.apache.cassandra.db:type=CompactionManager, attribute:TotalBytesCompacted, type:read], value:7551629839081] I briefly looked at the code for CompactionManager and a few related classes and don't see anyplace that is performing subtraction explicitly; however, there are many additions of signed long values that are not validated and could conceivably contain a negative value thus causing the totalBytesCompacted to decrease. It's interesting to note that the all of the differences I've seen so far are more than the overflow value of a signed 32 bit value. The OS (CentOS 5.7) and sun java vm (1.6.0_29) are both 64 bit. JNA is enabled. Is this expected and normal? If so, what is the correct interpretation of this metric? I'm seeing the negatives values a few times per hour when reading it once every 60 seconds. -Bryan
Re: how to avoid range ghosts?
Read this: http://wiki.apache.org/cassandra/FAQ#range_ghosts Then say this to yourself: http://cn1.kaboodle.com/img/b/0/0/196/4/C1xHoQAAAZZL9w/ghostbusters-logo-i-aint-afraid-of-no-ghost-pinback-button-1.25-pin-badge.jpg?v=1320511953000 On Sun, Oct 7, 2012 at 4:15 AM, Satoshi Yamada bigtvioletb...@yahoo.co.jpwrote: Hi, What is the recommended way to avoid range ghost in using get_range()? In my case, order of the key is not problem. It seems valid to use random :start_key in every query, but i'm new to cassandra and do not know if it's recommended or not. I use Cassandra 1.1.4 and ruby client. Range ghosts happens when one process keeps on inserting data while other process get_range and delete them. thanks in advance, satoshi
Re: can I have a mix of 32 and 64 bit machines in a cluster?
Java abstracts you from all these problems. One thing to look out for is JVM options do not work across all JVMs. For example if you try to enable https://wikis.oracle.com/display/HotSpotInternals/CompressedOops on a 32bit machine the JVM fails to start. On Tue, Oct 9, 2012 at 1:45 PM, Brian Tarbox tar...@cabotresearch.com wrote: I can't imagine why this would be a problem but I wonder if anyone has experience with running a mix of 32 and 64 bit nodes in a cluster. (I'm not going to do this in production, just trying to make use of the gear I have for my local system). Thanks.
Re: unexpected behaviour on seed nodes when using -Dcassandra.replace_token
Yes. That would be a good jira if it is not already listed. If node is a seed node autobootstrap and replicate_token settings should trigger a fatal non-start because your giving c* conflicting directions. Edward On Fri, Oct 19, 2012 at 8:49 AM, Thomas van Neerijnen t...@bossastudios.com wrote: Hi all I recently tried to replace a dead node using -Dcassandra.replace_token=token, which so far has been good to me. However on one of my nodes this option was ignored and the node simply picked a different token to live at and started up there. It was a foolish mistake on my part because it was set as a seed node, which results in this error in the log file: INFO [main] 2012-10-19 12:41:00,886 StorageService.java (line 518) This node will not auto bootstrap because it is configured to be a seed node. but it seems a little scary that this would mean it'll just ignore the fact that you want a replace a token and put itself somewhere else in the cluster. Surely it should behave similarly to trying to replace a live node by throwing some kind of exception?
Java 7 support?
We have been using cassandra and java7 for months. No problems. A key concept of java is portable binaries. There are sometimes wrinkles with upgrades. If you hit one undo the upgrade and restart. On Tuesday, October 23, 2012, Eric Evans eev...@acunu.com wrote: On Tue, Oct 16, 2012 at 7:54 PM, Rob Coli rc...@palominodb.com wrote: On Tue, Oct 16, 2012 at 4:45 PM, Edward Sargisson edward.sargis...@globalrelay.net wrote: The Datastax documentation says that Java 7 is not recommended[1]. However, Java 6 is due to EOL in Feb 2013 so what is the reasoning behind that comment? I've asked this approximate question here a few times, with no official response. The reason I ask is that in addition to Java 7 not being recommended, in Java 7 OpenJDK becomes the reference JVM, and OpenJDK is also not recommended. From other channels, I have conjectured that the current advice on Java 7 is it 'works' but is not as extensively tested (and definitely not as commonly deployed) as Java 6. That sounds about right. The best way to change the status quo would be to use Java 7, report any bugs you find, and share your experiences. -- Eric Evans Acunu | http://www.acunu.com | @acunu
Re: Keeping the record straight for Cassandra Benchmarks...
Yes another benchmark with 100,000,000 rows on EC2 machines probably less powerful then my laptop. The benchmark might as well have run 4 vmware instances on the same desktop. On Thu, Oct 25, 2012 at 7:40 AM, Brian O'Neill b...@alumni.brown.edu wrote: People probably saw... http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/tech/2012/102212-nosql-263595.html To clarify things take a look at... http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Java 7 support?
I am using the Sun JDK. There are only two issues I have found unrelated to Cassandra. 1) DateFormat is more liberal mmDD vs yyymmdd If you write an application with java 7 the format is forgiving with DD vs dd. Yet if you deploy that application to some JDK 1.6 jvms it fails 2) Ran into some issues with timsort() http://stackoverflow.com/questions/6626437/why-does-my-compare-method-throw-exception-comparison-method-violates-its-gen Again neither of these manifested in cassandra but did manifest with other applications. On Wed, Oct 24, 2012 at 9:14 PM, Andrey V. Panov panov.a...@gmail.com wrote: Are you using openJDK or Oracle JDK? I know java7 should be based on openJDK since 7, but still not sure. On 25 October 2012 05:42, Edward Capriolo edlinuxg...@gmail.com wrote: We have been using cassandra and java7 for months. No problems. A key concept of java is portable binaries. There are sometimes wrinkles with upgrades. If you hit one undo the upgrade and restart.
Large results and network round trips
Hello all, Currently we implement wide rows for most of our entities. For example: user { event1=x event2=y event3=z ... } Normally the entires are bounded to be less then 256 columns and most columns are small in size say 30 bytes. Because the blind write nature of Cassandra it is possible the column family can get much larger. We have very low latency requirements for example say less then (5ms). Considering network rountrip and all other factors I am wondering what is the largest column that is possible in a 5ms window on a GB network. First we have our thrift limits 15MB, is it possible even in the best case scenario to deliver a 15MB response in under 5ms on a GigaBit ethernet for example? Does anyone have any real world numbers with reference to package sizes and standard performance? Thanks all, Edward
Re: Large results and network round trips
For this scenario, remove disk speed from the equation. Assume the row is completely in Row Cache. Also lets assume Read.ONE. With this information I would be looking to determine response size/maximum requests second/max latency. I would use this to say You want to do 5,000 reads/sec, on a GigaBit ethernet, and each row is 10K, in under 5ms latency Sorry that is impossible. On Thu, Oct 25, 2012 at 2:58 PM, sankalp kohli kohlisank...@gmail.com wrote: I dont have any sample data on this, but read latency will depend on these 1) Consistency level of the read 2) Disk speed. Also you can look at the Netflix client as it makes the co-ordinator node same as the node which holds that data. This will reduce one hop. On Thu, Oct 25, 2012 at 9:04 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Hello all, Currently we implement wide rows for most of our entities. For example: user { event1=x event2=y event3=z ... } Normally the entires are bounded to be less then 256 columns and most columns are small in size say 30 bytes. Because the blind write nature of Cassandra it is possible the column family can get much larger. We have very low latency requirements for example say less then (5ms). Considering network rountrip and all other factors I am wondering what is the largest column that is possible in a 5ms window on a GB network. First we have our thrift limits 15MB, is it possible even in the best case scenario to deliver a 15MB response in under 5ms on a GigaBit ethernet for example? Does anyone have any real world numbers with reference to package sizes and standard performance? Thanks all, Edward
Re: disable compaction node-wide
If you are using sized teired set minCompactionThreshold to 0 and maxCompactionThreshold to 0. You can probably also use this https://issues.apache.org/jira/browse/CASSANDRA-2130 But if you do not compact the number of sstables gets high and then read performance can suffer. On Sat, Oct 27, 2012 at 4:21 PM, Radim Kolar h...@filez.com wrote: its possible to disable node wide all sstable compaction? I cant find anything suitable in JMX console.
Getting all schema in 1.2.0-beta-1
Using 1.2.0-beta1. I am noticing that there is no longer a single way to get all the schema. It seems like non-compact storage can be seen with show schema, but other tables are not visible. Is this by design, bug, or operator error? http://pastebin.com/PdSDsdTz
How does Cassandra optimize this query?
If we create a column family: CREATE TABLE videos ( videoid uuid, videoname varchar, username varchar, description varchar, tags varchar, upload_date timestamp, PRIMARY KEY (videoid,videoname) ); The CLI views this column like so: create column family videos with column_type = 'Standard' and comparator = 'CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)' and default_validation_class = 'UTF8Type' and key_validation_class = 'UUIDType' and read_repair_chance = 0.1 and dclocal_read_repair_chance = 0.0 and gc_grace = 864000 and min_compaction_threshold = 4 and max_compaction_threshold = 32 and replicate_on_write = true and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy' and caching = 'KEYS_ONLY' and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'}; [default@videos] list videos; Using default limit of 100 Using default column limit of 100 --- RowKey: b3a76c6b-7c7f-4af6-964f-803a9283c401 = (column=Now my dog plays piano!:description, value=My dog learned to play the piano b ecause of the cat., timestamp=135205828907) = (column=Now my dog plays piano!:tags, value=dogs,piano,lol, timestamp=1352058289070001) invalid UTF8 bytes 0139794c30c0 SELECT * FROM videos WHERE videoname = 'My funny cat'; videoid | videoname| description | tags | u pload_date | username --+--+---++-- +-- 99051fe9-6a9c-46c2-b949-38ef78858dd0 | My funny cat | My cat likes to play the piano! So funny. | cats,piano,lol | 2 012-06-01 08:00:00+ |ctodd CQL3 Allows me to search the second component of a primary key. Which really just seems to be component 1 of a composite column. So what thrift operation does this correspond to? This looks like a column slice without specifying a key? How does this work internally?
Re: How does Cassandra optimize this query?
I see. It is fairly misleading because it is a query that does not work at scale. This syntax is only helpful if you have less then a few thousand rows in Cassandra. On Mon, Nov 5, 2012 at 12:24 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Mon, Nov 5, 2012 at 4:12 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Is this query the equivalent of a full table scan? Without a starting point get_range_slice is just starting at token 0? It is, but that's what you asked for after all. If you want to start at a given token you can do: SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) 'whatevertokenyouwant' You can even do: SELECT * FROM videos WHERE videoname = 'My funny cat' AND token(video) token(99051fe9-6a9c-46c2-b949-38ef78858dd0) if that's simpler for you than computing the token manually. Though that is mostly for random partitioners. For ordered ones, you can do without the token() part. -- Sylvain
Re: triggers(newbie)
There are no built-in trigger. Someone has written an aspect oriented piece to do triggers outside of the project. http://brianoneill.blogspot.com/2012/03/cassandra-triggers-for-indexing-and.html On Mon, Nov 5, 2012 at 12:30 PM, davuk...@veleri.hr wrote: Hello! I was wondering if someone could help me a bit with triggers in cassandra. I am doing a school project with this DBMS, and i would be very happy if you could send me a simple example/explanation of a trigger. Thank you!! :)
Re: How does Cassandra optimize this query?
A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. Yes. Bingo. It is misleading because it is not useful in any other context besides someone playing around with a ten row table in cqlsh. CQL stops me from executing some queries that are not efficient, yet it allows this one. If I am new to Cassandra and developing, this query works and produces a result then once my database gets real data produces a different result (likely an empty one). When I first saw this query two things came to my mind. 1) CQL (and Cassandra) must be somehow indexing all the fields of a primary key to make this search optimal. 2) This is impossible CQL must be gathering the first hundred random rows and finding this thing. What it is happening is case #2. In a nutshell CQL is just sampling some data and running the query on it. We could support all types of query constructs if we just take the first 100 rows and apply this logic to it, but these things are not helpful for anything but light ad-hoc data exploration. My suggestions: 1) force people to supply a LIMIT clause on any query that is going to page over get_range_slice 2) having some type of explain support so I can establish if this query will work in the I say this because as an end user I do not understand if a given query is actually going to return the same results with different data. On Mon, Nov 5, 2012 at 1:40 PM, Sylvain Lebresne sylv...@datastax.com wrote: On Mon, Nov 5, 2012 at 6:55 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I see. It is fairly misleading because it is a query that does not work at scale. This syntax is only helpful if you have less then a few thousand rows in Cassandra. Just for the sake of argument, how is that misleading? If you have billions of rows and do the select statement from you initial mail, what did the syntax lead you to believe it would return? A remark like maybe we just shouldn't allow that and leave that to the map-reduce side would make sense, but I don't see how this is misleading. But again, this translate directly to a get_range_slice (that don't scale if you have billion of rows and don't limit the output either) so there is nothing new here.
Re: Multiple keyspaces vs Multiple CFs
it is better to have one keyspace unless you need to replicate the keyspaces differently. The main reason for this is that changing keyspaces requires an RPC operation. Having 10 keyspaces would mean having 10 connection pools. On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com wrote: Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100 keyspaces with 1 CF each. I am talking in terms of memory footprint. Also I would be interested to know how much better one is over other. Thanks, Sankalp
Re: Multiple keyspaces vs Multiple CFs
Any connection pool. Imagine if you have 10 column families in 10 keyspaces. You pull a connection off the pool and the odds are 1 in 10 of it being connected to the keyspace you want. So 9 out of 10 times you have to have a network round trip just to change the keyspace, or you have to build a keyspace aware connection pool. Edward On Thu, Nov 8, 2012 at 5:36 PM, sankalp kohli kohlisank...@gmail.com wrote: Which connection pool are you talking about? On Thu, Nov 8, 2012 at 2:19 PM, Edward Capriolo edlinuxg...@gmail.com wrote: it is better to have one keyspace unless you need to replicate the keyspaces differently. The main reason for this is that changing keyspaces requires an RPC operation. Having 10 keyspaces would mean having 10 connection pools. On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com wrote: Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100 keyspaces with 1 CF each. I am talking in terms of memory footprint. Also I would be interested to know how much better one is over other. Thanks, Sankalp
Re: Multiple keyspaces vs Multiple CFs
In the old days the API looked like this. client.insert(Keyspace1, key_user_id, new ColumnPath(Standard1, null, name.getBytes(UTF-8)), Chris Goffinet.getBytes(UTF-8), timestamp, ConsistencyLevel.ONE); but now it works like this /pay attention to this below -/ client.set_keyspace(keyspace1); /pay attention to this above -/ client.insert( key_user_id, new ColumnPath(Standard1, null, name.getBytes(UTF-8)), Chris Goffinet.getBytes(UTF-8), timestamp, ConsistencyLevel.ONE); So each time you switch keyspaces you make a network round trip. On Thu, Nov 8, 2012 at 6:17 PM, sankalp kohli kohlisank...@gmail.com wrote: I am a bit confused. One connection pool I know is the one which MessageService has to other nodes. Then there will be incoming connections via thrift from clients. How are they affected by multiple keyspaces? On Thu, Nov 8, 2012 at 3:14 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Any connection pool. Imagine if you have 10 column families in 10 keyspaces. You pull a connection off the pool and the odds are 1 in 10 of it being connected to the keyspace you want. So 9 out of 10 times you have to have a network round trip just to change the keyspace, or you have to build a keyspace aware connection pool. Edward On Thu, Nov 8, 2012 at 5:36 PM, sankalp kohli kohlisank...@gmail.com wrote: Which connection pool are you talking about? On Thu, Nov 8, 2012 at 2:19 PM, Edward Capriolo edlinuxg...@gmail.com wrote: it is better to have one keyspace unless you need to replicate the keyspaces differently. The main reason for this is that changing keyspaces requires an RPC operation. Having 10 keyspaces would mean having 10 connection pools. On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com wrote: Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100 keyspaces with 1 CF each. I am talking in terms of memory footprint. Also I would be interested to know how much better one is over other. Thanks, Sankalp
Re: Multiple keyspaces vs Multiple CFs
It is not as bad with hector, but still each Keyspace object is another socket open to Cassandra. If you have 500 webservers and 10 keyspaces. Instead of having 5000 connections you now have 5000. On Thu, Nov 8, 2012 at 6:35 PM, sankalp kohli kohlisank...@gmail.com wrote: I think this code is from the thrift part. I use hector. In hector, I can create multiple keyspace objects for each keyspace and use them when I want to talk to that keyspace. Why will it need to do a round trip to the server for each switch. On Thu, Nov 8, 2012 at 3:28 PM, Edward Capriolo edlinuxg...@gmail.com wrote: In the old days the API looked like this. client.insert(Keyspace1, key_user_id, new ColumnPath(Standard1, null, name.getBytes(UTF-8)), Chris Goffinet.getBytes(UTF-8), timestamp, ConsistencyLevel.ONE); but now it works like this /pay attention to this below -/ client.set_keyspace(keyspace1); /pay attention to this above -/ client.insert( key_user_id, new ColumnPath(Standard1, null, name.getBytes(UTF-8)), Chris Goffinet.getBytes(UTF-8), timestamp, ConsistencyLevel.ONE); So each time you switch keyspaces you make a network round trip. On Thu, Nov 8, 2012 at 6:17 PM, sankalp kohli kohlisank...@gmail.com wrote: I am a bit confused. One connection pool I know is the one which MessageService has to other nodes. Then there will be incoming connections via thrift from clients. How are they affected by multiple keyspaces? On Thu, Nov 8, 2012 at 3:14 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Any connection pool. Imagine if you have 10 column families in 10 keyspaces. You pull a connection off the pool and the odds are 1 in 10 of it being connected to the keyspace you want. So 9 out of 10 times you have to have a network round trip just to change the keyspace, or you have to build a keyspace aware connection pool. Edward On Thu, Nov 8, 2012 at 5:36 PM, sankalp kohli kohlisank...@gmail.com wrote: Which connection pool are you talking about? On Thu, Nov 8, 2012 at 2:19 PM, Edward Capriolo edlinuxg...@gmail.com wrote: it is better to have one keyspace unless you need to replicate the keyspaces differently. The main reason for this is that changing keyspaces requires an RPC operation. Having 10 keyspaces would mean having 10 connection pools. On Thu, Nov 8, 2012 at 4:59 PM, sankalp kohli kohlisank...@gmail.com wrote: Is it better to have 10 Keyspaces with 10 CF in each keyspace. or 100 keyspaces with 1 CF each. I am talking in terms of memory footprint. Also I would be interested to know how much better one is over other. Thanks, Sankalp
Re: Retrieve Multiple CFs from Range Slice
HBase is different is this regard. A table is comprised of multiple column families, and they can be scanned at once. However, last time I checked, scanning a table with two column families is still two seeks across three different column families. A similar thing can be accomplished in cassandra by issuing two range scans, (possibly executing them asynchronously in two threads) I am sure someone will correct me if I am mistaken. On Fri, Nov 9, 2012 at 11:46 PM, Chris Larsen clar...@euphoriaaudio.com wrote: Hi! Is there a way to retrieve the columns for all column families on a given row while fetching range slices? My keyspace has two column families and when I’m scanning over the rows, I’d like to be able to fetch the columns in both CFs while iterating over the keys so as to avoid having to run two scan operations. When I set the CF to an empty string, ala ColumnParent.setColumn_family(), it throws an error “non-empty columnfamily is required”. (Using the Thrift API directly from JAVA on Cass 1.1.6) My HBase scans can return both CFs per row so it works nicely. Thanks!
Re: leveled compaction and tombstoned data
No it does not exist. Rob and I might start a donation page and give the money to whoever is willing to code it. If someone would write a tool that would split an sstable into 4 smaller sstables (even an offline command line tool) I would paypal them a hundo. On Sat, Nov 10, 2012 at 1:10 PM, Aaron Turner synfina...@gmail.com wrote: Nope. I think at least once a week I hear someone suggest one way to solve their problem is to write an sstablesplit tool. I'm pretty sure that: Step 1. Write sstablesplit Step 2. ??? Step 3. Profit! On Sat, Nov 10, 2012 at 9:40 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: @Rob Coli Does the sstablesplit function exists somewhere ? 2012/11/10 Jim Cistaro jcist...@netflix.com For some of our clusters, we have taken the periodic major compaction route. There are a few things to consider: 1) Once you start major compacting, depending on data size, you may be committed to doing it periodically because you create one big file that will take forever to naturally compact agaist 3 like sized files. 2) If you rely heavily on file cache (rather than large row caches), each major compaction effectively invalidates the entire file cache beause everything is written to one new large file. -- Jim Cistaro On 11/9/12 11:27 AM, Rob Coli rc...@palominodb.com wrote: On Thu, Nov 8, 2012 at 10:12 AM, B. Todd Burruss bto...@gmail.com wrote: my question is would leveled compaction help to get rid of the tombstoned data faster than size tiered, and therefore reduce the disk space usage? You could also... 1) run a major compaction 2) code up sstablesplit 3) profit! This method incurs a management penalty if not automated, but is otherwise the most efficient way to deal with tombstones and obsolete data.. :D =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: [BETA RELEASE] Apache Cassandra 1.2.0-beta2 released
just a note for all. The default partitioner is no longer randompartitioner. It is now murmur, and the token range starts in negative numbers. So you don't chose tokens Luke your father taught you anymore. On Friday, November 9, 2012, Sylvain Lebresne sylv...@datastax.com wrote: The Cassandra team is pleased to announce the release of the second beta for the future Apache Cassandra 1.2.0. Let me first stress that this is beta software and as such is *not* ready for production use. This release is still beta so is likely not bug free. However, lots have been fixed since beta1 and if everything goes right, we are hopeful that a first release candidate may follow shortly. Please do help testing this beta to help make that happen. If you encounter any problem during your testing, please report[3,4] them. And be sure to a look at the change log[1] and the release notes[2] to see where Cassandra 1.2 differs from the previous series. Apache Cassandra 1.2.0-beta2[5] is available as usual from the cassandra website (http://cassandra.apache.org/download/) and a debian package is available using the 12x branch (see http://wiki.apache.org/cassandra/DebianPackaging). Thank you for your help in testing and have fun with it. [1]: http://goo.gl/wnDAV (CHANGES.txt) [2]: http://goo.gl/CBsqs (NEWS.txt) [3]: https://issues.apache.org/jira/browse/CASSANDRA [4]: user@cassandra.apache.org [5]: http://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=shortlog;h=refs/tags/cassandra-1.2.0-beta2
Re: CREATE COLUMNFAMILY
If you supply metadata cassandra can use it for several things. 1) It validates data on insertion 2) Helps display the information in human readable formats in tools like the CLI and sstabletojson 3) If you add a built-in secondary index the type information is needed, strings sort differently then integer 4) columns in rows are sorted by the column name, strings sort differently then integers On Sat, Nov 10, 2012 at 11:55 PM, Kevin Burton rkevinbur...@charter.net wrote: I am sure this has been asked before but what is the purpose of entering key/value or more correctly key name/data type values on the CREATE COLUMNFAMILY command.
Re: removing SSTABLEs
If you shutdown c* and remove an sstable (and it associated data, index, bloom filter , and etc) files it is safe. I would delete any saved caches as well. It is safe in the sense that Cassandra will start up with no issues, but you could be missing some data. On Sun, Nov 11, 2012 at 11:09 PM, B. Todd Burruss bto...@gmail.com wrote: if i stop a node and remove an SSTABLE, let's call it X, is that safe? ok, more info. i know that the data in SSTABLE X has been tombstoned but the tomstones are in SSTABLE Y. i want to simply delete X and get rid of the data. how do i know this .. i did a major compaction a while back and the SSTABLE is so large it has not yet been compacted. we delete data daily and only keep 7 days of data. the SSTABLE is almost 30 days old. whattayathink?
Re: removing SSTABLEs
Because you did a major compaction that table is larger then all the rest. So it will never go away until you have 3 other tables about that size or you run major compaction again. You should vote on the ticket: https://issues.apache.org/jira/browse/CASSANDRA-4766 On Mon, Nov 12, 2012 at 11:51 AM, Jason Wee peich...@gmail.com wrote: The existence of sstable X will give an impact to the system or cluster? when the compaction threshold is reach, the sstable x and sstable y will be compacted. it's more like the system responsibility than human intervention. On Mon, Nov 12, 2012 at 12:09 PM, B. Todd Burruss bto...@gmail.com wrote: if i stop a node and remove an SSTABLE, let's call it X, is that safe? ok, more info. i know that the data in SSTABLE X has been tombstoned but the tomstones are in SSTABLE Y. i want to simply delete X and get rid of the data. how do i know this .. i did a major compaction a while back and the SSTABLE is so large it has not yet been compacted. we delete data daily and only keep 7 days of data. the SSTABLE is almost 30 days old. whattayathink?
Re: unable to read saved rowcache from disk
Yes the row cache could be incorrect so on startup cassandra verify they saved row cache by re reading. It takes a long time so do not save a big row cache. On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote: I have a rowcache provieded by SerializingCacheProvider. The data that has been read into it is about 500MB, as claimed by jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose the size from jconsole is before serializing. Now while restarting Cassandra, it's unable to read saved rowcache back. By unable, I mean around 4 hours and I have to abort it and remove cache so as not to suspend other tasks. Since the data aren't huge, why Cassandra can't read it back? My Cassandra is 1.2.0-beta2.
Re: Read during digest mismatch
I think the code base does not benefit from having too many different read code paths. Logically what your suggesting is reasonable, but you have to consider the case of one being slow to respond. Then what? On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote: If consistency is two, don't we just send data request to one and digest request to another? On Mon, Nov 12, 2012 at 2:49 AM, Jonathan Ellis jbel...@gmail.com wrote: Correct. Which is one reason there is a separate setting for cross-datacenter read repair, by the way. On Thu, Nov 8, 2012 at 4:43 PM, sankalp kohli kohlisank...@gmail.com wrote: Hi, Lets say I am reading with consistency TWO and my replication is 3. The read is eligible for global read repair. It will send a request to get data from one node and a digest request to two. If there is a digest mismatch, what I am reading from the code looks like it will get the data from all three nodes and do a resolve of the data before returning to the client. Is it correct or I am readind the code wrong? Also if this is correct, look like if the third node is in other DC, the read will slow down even when the consistency was TWO? Thanks, Sankalp -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: unable to read saved rowcache from disk
http://wiki.apache.org/cassandra/LargeDataSetConsiderations A negative side-effect of a large row-cache is start-up time. The periodic saving of the row cache information only saves the keys that are cached; the data has to be pre-fetched on start-up. On a large data set, this is probably going to be seek-bound and the time it takes to warm up the row cache will be linear with respect to the row cache size (assuming sufficiently large amounts of data that the seek bound I/O is not subject to optimization by disks) Assuming a row cache 15MB and the average row is 300 bytes, that could be 50,000 entries. 4 hours seems like a long time to read back 50K entries. Unless the source table was very large and you can only do a small number / reads/sec. On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com wrote: incorrect... what do you mean? I think it's only 15MB, which is not big. On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Yes the row cache could be incorrect so on startup cassandra verify they saved row cache by re reading. It takes a long time so do not save a big row cache. On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote: I have a rowcache provieded by SerializingCacheProvider. The data that has been read into it is about 500MB, as claimed by jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose the size from jconsole is before serializing. Now while restarting Cassandra, it's unable to read saved rowcache back. By unable, I mean around 4 hours and I have to abort it and remove cache so as not to suspend other tasks. Since the data aren't huge, why Cassandra can't read it back? My Cassandra is 1.2.0-beta2.
Re: Offsets and Range Queries
There are several reasons. First there is no absolute offset. The rows are sorted by the data. If someone inserts new data between your query and this query the rows have changed. Unless you doing select queries inside a transaction with repeatable read and your database supports this the query you mention does not really have absolute offsets either. The results of the query can change between reads. In cassandra we do not execute large queries (that might results to temp tables or whatever) and allow you to page them. Slices have a fixed size, this ensures that the the query does not execute for arbitrary lengths of time. On Thu, Nov 15, 2012 at 6:39 AM, Ravikumar Govindarajan ravikumar.govindara...@gmail.com wrote: Usually we do a SELECT * FROM ORDER BY LIMIT 26,25 for pagination purpose, but specifying offset is not available for range queries in cassandra. I always have to specify a start-key to achieve this. Are there reasons for choosing such an approach rather than providing an absolute offset? -- Ravi
Re: unable to read saved rowcache from disk
If the startup is taking a long time or not working and you believe it to be corrupt in some way it is safe to delete the saved cache files. If you think the process is taking longer then it should you could try attaching a debugger to the process. I try to avoid the row cache these days, even with cache auto tuning (which I am not using) 1 really wide row can cause issues. I like letting the os disk cache do it's thing. On Thu, Nov 15, 2012 at 2:20 AM, Wz1975 wz1...@yahoo.com wrote: Before shut down, you saw rowcache has 500m, 1.6m rows, each row average 300B, so 700k row should be a little over 200m, unless it is reading more, maybe tombstone? Or the rows on disk have grown for some reason, but row cache was not updated? Could be something else eats up the memory. You may profile memory and see who consumes the memory. Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: unable to read saved rowcache from disk From: Manu Zhang owenzhang1...@gmail.com To: user@cassandra.apache.org CC: 3G, other jvm parameters are unchanged. On Thu, Nov 15, 2012 at 2:40 PM, Wz1975 wz1...@yahoo.com wrote: How big is your heap? Did you change the jvm parameter? Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: unable to read saved rowcache from disk From: Manu Zhang owenzhang1...@gmail.com To: user@cassandra.apache.org CC: add a counter and print out myself On Thu, Nov 15, 2012 at 1:51 PM, Wz1975 wz1...@yahoo.com wrote: Curious where did you see this? Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Re: unable to read saved rowcache from disk From: Manu Zhang owenzhang1...@gmail.com To: user@cassandra.apache.org CC: OOM at deserializing 747321th row On Thu, Nov 15, 2012 at 9:08 AM, Manu Zhang owenzhang1...@gmail.com wrote: oh, as for the number of rows, it's 165. How long would you expect it to be read back? On Thu, Nov 15, 2012 at 3:57 AM, Wei Zhu wz1...@yahoo.com wrote: Good information Edward. For my case, we have good size of RAM (76G) and the heap is 8G. So I set the row cache to be 800M as recommended. Our column is kind of big, so the hit ratio for row cache is around 20%, so according to datastax, might just turn the row cache altogether. Anyway, for restart, it took about 2 minutes to load the row cache INFO [main] 2012-11-14 11:43:29,810 AutoSavingCache.java (line 108) reading saved cache /var/lib/cassandra/saved_caches/XXX-f2-RowCache INFO [main] 2012-11-14 11:45:12,612 ColumnFamilyStore.java (line 451) completed loading (102801 ms; 21125 keys) row cache for XXX.f2 Just for comparison, our key is long, the disk usage for row cache is 253K. (it only stores key when row cache is saved to disk, so 253KB/ 8bytes = 31625 number of keys). It's about right... So for 15MB, there could be a lot of narrow rows. (if the key is Long, could be more than 1M rows) Thanks. -Wei From: Edward Capriolo edlinuxg...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, November 13, 2012 11:13 PM Subject: Re: unable to read saved rowcache from disk http://wiki.apache.org/cassandra/LargeDataSetConsiderations A negative side-effect of a large row-cache is start-up time. The periodic saving of the row cache information only saves the keys that are cached; the data has to be pre-fetched on start-up. On a large data set, this is probably going to be seek-bound and the time it takes to warm up the row cache will be linear with respect to the row cache size (assuming sufficiently large amounts of data that the seek bound I/O is not subject to optimization by disks) Assuming a row cache 15MB and the average row is 300 bytes, that could be 50,000 entries. 4 hours seems like a long time to read back 50K entries. Unless the source table was very large and you can only do a small number / reads/sec. On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com wrote: incorrect... what do you mean? I think it's only 15MB, which is not big. On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Yes the row cache could be incorrect so on startup cassandra verify they saved row cache by re reading. It takes a long time so do not save a big row cache. On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote: I have a rowcache provieded by SerializingCacheProvider. The data that has been read into it is about 500MB, as claimed by jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose the size from jconsole is before serializing. Now while restarting Cassandra, it's unable to read saved rowcache back. By unable, I mean around 4 hours and I have to abort it and remove cache so as not to suspend other tasks. Since
Re: Question regarding the need to run nodetool repair
On Thursday, November 15, 2012, Dwight Smith dwight.sm...@genesyslab.com wrote: I have a 4 node cluster, version 1.1.2, replication factor of 4, read/write consistency of 3, level compaction. Several questions. 1) Should nodetool repair be run regularly to assure it has completed before gc_grace? If it is not run, what are the exposures? Yes. Lost tombstones could cause deleted data to re appear. 2) If a node goes down, and is brought back up prior to the 1 hour hinted handoff expiration, should repair be run immediately? If node is brought up prior to 1 hour. You should let the hints replay. Repair is always safe to run. 3) If the hinted handoff has expired, the plan is to remove the node and start a fresh node in its place. Does this approach cause problems? You only need to join a fresh mode if the node was down longer then gc grace. Default is 10 days. Thanks If you read and write at quorum and run repair regularly you can worry less about the things above because they are essentially non factors.
Re: Admin for cassandra?
We should build an eclipse plugin named Eclipsandra or something. On Thu, Nov 15, 2012 at 9:45 PM, Wz1975 wz1...@yahoo.com wrote: Cqlsh is probably the closest you will get. Or pay big bucks to hire someone to develop one for you:) Thanks. -Wei Sent from my Samsung smartphone on ATT Original message Subject: Admin for cassandra? From: Kevin Burton rkevinbur...@charter.net To: user@cassandra.apache.org CC: Is there an IDE for a Cassandra database? Similar to the SQL Server Management Studio for SQL server. I mainly want to execute queries and see the results. Preferably that runs under a Windows OS. Thank you.
Re: Collections, query for contains?
This was my first question after I git the inserts working. Hive has udfs like array contains. It also has lateral view syntax that is similar to transposed. On Monday, November 19, 2012, Timmy Turner timm.t...@gmail.com wrote: Is there no option to query for the contents of a collection? Something like select * from cf where c_list contains('some_value') or select * from cf where c_map contains('some_key') or select * from cf where c_map['some_key'] contains('some_value')
Re: SchemaDisagreementException
even if you made the calls through cql you would have the same issue since cql uses thrift. 1.2:0 is supposed to be nicer with concurrent modifications. On Monday, November 19, 2012, Everton Lima peitin.inu...@gmail.com wrote: I was using cassandra direct because it has more performace than using CQL. Therefore, I am using cassandra because of replication factor and consistence of data. I am using it as a lib of my app. I only make sample querys, just use a key to point to a data. 2012/11/16 Everton Lima peitin.inu...@gmail.com I do that because I need to create a dynamic column families. I create 2 keyspaces in the start of application, using embedded cassandra instance too, but it's never throw exception. And than, insert dynamic column families in this 2 keyspaces. I put a Thread.sleep(3000); in the middle of the creation column family code. int watiTime = 3000; logger.info(Waiting +(watiTime/1000)+s for synchronizing ...); Thread.sleep(watiTime); CassandraHelper.createColumnFamily(CassandraHelper.KEYSPACE, layer); logger.info(Waiting +(watiTime/1000)+s for synchronizing ...); Thread.sleep(watiTime); I do that, because in the code of CassandraStress, after create a column family, it do that too. It is wrong or good solution? Any other idea? 2012/11/14 aaron morton aa...@thelastpickle.com Out of interest why are you creating column families by making direct calls on an embedded cassandra instance ? I would guess you life would be easier if you defined a schema in CQL or CLI. I already read in the documentation that this error occurs when more than one thread/processor access the same place in the Cassandra, but I think this is not occuring. How may nodes do you have ? I am using 3 nodes. What version are you running ? The version is 1.1.6 It sounds like you have run simultaneous schema updates and the global schema has diverged. If you can create your schema in CLI or CQL I would recommend doing that. If you are trying to do something more complicated you'll need to provide more information. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/11/2012, at 3:13 AM, Everton Lima peitin.inu...@gmail.com wrote: Some times, when I try to insert a data in Cassandra with Method: static void createColumnFamily(String keySpace, String columnFamily){ synchronized (mutex){ Iface cs = new CassandraServer(); CfDef cfDef = new CfDef(keySpace, colu
Re: SchemaDisagreementException
http://www.acunu.com/2/post/2011/12/cql-benchmarking.html Last I checked, thrift still had an edge over cql due to string serialization and de serialization. Might be even more dramatic for later columns. Not that client speed matters much overall in cassandra's speed, but CQL client does more. On Mon, Nov 19, 2012 at 9:27 PM, Michael Kjellman mkjell...@barracuda.com wrote: While this might not be helpful (I don't have all the thread history here), have you checked that all your servers are properly synced with NTP? From: Everton Lima peitin.inu...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Monday, November 19, 2012 6:24 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: SchemaDisagreementException Yes I already have tested. I use the Object CassandraServer to do the operations instead of open conection with CassandraClient. Both of this object implements Iface. I think the performace of use CassandraServer improve because it does not open an connection, and CassandraClient (that use thrift) and CQL open a connection. 2012/11/19 Tyler Hobbs ty...@datastax.com Have you actually tested to see that the Thrift API is more performant than CQL for your application? As far as I know, CQL almost always has a performance advantage over the Thrift API. On Mon, Nov 19, 2012 at 1:05 PM, Everton Lima peitin.inu...@gmail.com wrote: For some reason I can not reply my old thread in that list. So I was creating a new one. The problem is that I do not use thrift to gain in performace. Why it is nicer with concurrent modifications? I do not know why I have falling in the problem of concurrent modification if I was creating 2 keyspaces diferent in only one process with just one thread. Someone knows why? -- Everton Lima Aleixo Bacharel em Ciencia da Computação Universidade Federal de Goiás -- Tyler Hobbs DataStax -- Everton Lima Aleixo Bacharel em Ciencia da Computação Universidade Federal de Goiás -- 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: Query regarding SSTable timestamps and counts
On Tue, Nov 20, 2012 at 5:23 PM, aaron morton aa...@thelastpickle.com wrote: My understanding of the compaction process was that since data files keep continuously merging we should not have data files with very old last modified timestamps It is perfectly OK to have very old SSTables. But performing an upgradesstables did decrease the number of data files and removed all the data files with the old timestamps. upgradetables re-writes every sstable to have the same contents in the newest format. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 19/11/2012, at 4:57 PM, Ananth Gundabattula agundabatt...@gmail.com wrote: Hello Aaron, Thanks a lot for the reply. Looks like the documentation is confusing. Here is the link I am referring to: http://www.datastax.com/docs/1.1/operations/tuning#tuning-compaction It does not disable compaction. As per the above url, After running a major compaction, automatic minor compactions are no longer triggered, frequently requiring you to manually run major compactions on a routine basis. ( Just before the heading Tuning Column Family compression in the above link) With respect to the replies below : it creates one big file, which will not be compacted until there are (by default) 3 other very big files. This is for the minor compaction and major compaction should theoretically result in one large file irrespective of the number of data files initially? This is not something you have to worry about. Unless you are seeing 1,000's of files using the default compaction. Well my worry has been because of the large amount of node movements we have done in the ring. We started off with 6 nodes and increased the capacity to 12 with disproportionate increases every time which resulted in a lot of clean of data folders except system, run repair and then a cleanup with an aborted attempt in between. There were some data.db files older by more than 2 weeks and were not modified since then. My understanding of the compaction process was that since data files keep continuously merging we should not have data files with very old last modified timestamps (assuming there is a good amount of writes to the table continuously) I did not have a for sure way of telling if everything is alright with the compaction looking at the last modified timestamps of all the data.db files. What are the compaction issues you are having ? Your replies confirm that the timestamps should not be an issue to worry about. So I guess I should not be calling them as issues any more. But performing an upgradesstables did decrease the number of data files and removed all the data files with the old timestamps. Regards, Ananth On Mon, Nov 19, 2012 at 6:54 AM, aaron morton aa...@thelastpickle.com wrote: As per datastax documentation, a manual compaction forces the admin to start compaction manually and disables the automated compaction (atleast for major compactions but not minor compactions ) It does not disable compaction. it creates one big file, which will not be compacted until there are (by default) 3 other very big files. 1. Does a nodetool stop compaction also force the admin to manually run major compaction ( I.e. disable automated major compactions ? ) No. Stop just stops the current compaction. Nothing is disabled. 2. Can a node restart reset the automated major compaction if a node gets into a manual mode compaction for whatever reason ? Major compaction is not automatic. It is the manual nodetool compact command. Automatic (minor) compaction is controlled by min_compaction_threshold and max_compaction_threshold (for the default compaction strategy). 3. What is the ideal number of SSTables for a table in a keyspace ( I mean are there any indicators as to whether my compaction is alright or not ? ) This is not something you have to worry about. Unless you are seeing 1,000's of files using the default compaction. For example, I have seen SSTables on the disk more than 10 days old wherein there were other SSTables belonging to the same table but much younger than the older SSTables ( No problems. 4. Does a upgradesstables fix any compaction issues ? What are the compaction issues you are having ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/11/2012, at 1:18 AM, Ananth Gundabattula agundabatt...@gmail.com wrote: We have a cluster running cassandra 1.1.4. On this cluster, 1. We had to move the nodes around a bit when we were adding new nodes (there was quite a good amount of node movement ) 2. We had to stop compactions during some of the days to save some disk space on some of the nodes when they were running very very low on disk spaces. (via nodetool stop COMPACTION) As per datastax documentation, a manual
Re: Other problem in update
I am just taking a stab at this one. UUID's interact with system time and maybe your real time os is doing something funky there. The other option, which seems more likely, is that your unit tests are not cleaning up their data directory and there is some corrupt data in there. On Tue, Nov 27, 2012 at 7:40 AM, Everton Lima peitin.inu...@gmail.comwrote: People, when i try to execute my program that use EmbeddedCassandraService, with the version 1.1.2 of cassandra in OpenSuse Real Time operation system it is throwing the follow exception: [27/11/12 10:27:28,314 BRST] ERROR service.CassandraDaemon: Exception in thread Thread[MutationStage:20,5,main] java.lang.NullPointerException at org.apache.cassandra.utils.UUIDGen.decompose(UUIDGen.java:96) at org.apache.cassandra.cql.jdbc.JdbcUUID.decompose(JdbcUUID.java:55) at org.apache.cassandra.db.marshal.UUIDType.decompose(UUIDType.java:187) at org.apache.cassandra.db.RowMutation.hintFor(RowMutation.java:107) at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:582) at org.apache.cassandra.service.StorageProxy$5.runMayThrow(StorageProxy.java:557) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) When I try to execute the same program in Ubuntu 12.04 the program starts without ERRORS. Someone could help me?? -- Everton Lima Aleixo Bacharel em Ciencia da Computação Universidade Federal de Goiás
Re: Java high-level client
Hector does not require an outdated version of thift, you are likely using an outdated version of hector. Here is the long and short of it: If the thrift thrift API changes then hector can have compatibility issues. This happens from time to time. The main methods like get() and insert() have remained the same, but the CFMetaData objects have changed. (this causes the incompatible class stuff you are seeing). CQLhas a different version of the same problem, the CQL syntax is version-ed. For example, if you try to execute a CQL3 query as a CQL2query it will likely fail. In the end your code still has to be version aware. With hector you get a compile time problem, with pure CQL you get a runtime problem. I have always had the opinion the project should have shipped hector with Cassandra, this would have made it obvious what version is likely to work. The new CQL transport client is not being shipped with Cassandra either, so you will still have to match up the versions. Although they should be largely compatible some time in the near or far future one of the clients probably wont work with one of the servers. Edward On Tue, Nov 27, 2012 at 11:10 AM, Michael Kjellman mkjell...@barracuda.comwrote: Netflix has a great client https://github.com/Netflix/astyanax On 11/27/12 7:40 AM, Peter Lin wool...@gmail.com wrote: I use hector-client master, which is pretty stable right now. It uses the latest thrift, so you can use hector with thrift 0.9.0. That's assuming you don't mind using the active development branch. On Tue, Nov 27, 2012 at 10:36 AM, Carsten Schnober schno...@ids-mannheim.de wrote: Hi, I'm aware that this has been a frequent question, but answers are still hard to find: what's an appropriate Java high-level client? I actually believe that the lack of a single maintained Java API that is packaged with Cassandra is quite an issue. The way the situation is right now, new users have to pick more or less randomly one of the available options from the Cassandra Wiki and find a suitable solution for their individual requirements through trial implementations. This can cause and lot of wasted time (and frustration). Personally, I've played with Hector before figuring out that it seems to require an outdated Thrift version. Downgrading to Thrift 0.6 is not an option for me though because I use Thrift 0.9.0 in other classes of the same project. So I've had a look at Kundera and at Easy-Cassandra. Both seem to lack a real documentation beyond the examples available in their Github repositories, right? Can more experienced users recommend either one of the two or some of the other options listed at the Cassandra Wiki? I know that this strongly depends on individual requirements, but all I need are simple requests for very basic queries. So I would like to emphasize the importance a clear documentation and a stable and well-maintained API. Any hints? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: counters + replication = awful performance?
The difference between Replication factor =1 and replication factor 1 is significant. Also it sounds like your cluster is 2 node so going from RF=1 to RF=2 means double the load on both nodes. You may want to experiment with the very dangerous column family attribute: - replicate_on_write: Replicate every counter update from the leader to the follower replicas. Accepts the values true and false. Edward On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman mkjell...@barracuda.comwrote: Are you writing with QUORUM consistency or ONE? On 11/27/12 9:52 AM, Sergey Olefir solf.li...@gmail.com wrote: Hi Juan, thanks for your input! In my case, however, I doubt this is the case -- clients are able to push many more updates than I need to saturate replication_factor=2 case (e.g. I'm doing as many as 6x more increments when testing 2-node cluster with replication_factor=1), so bandwidth between clients and server should be sufficient. Bandwidth between nodes in the cluster should also be quite sufficient since they are both in the same DC. But it is something to check, thanks! Best regards, Sergey Juan Valencia wrote Hi Sergey, I know I've had similar issues with counters which were bottle-necked by network throughput. You might be seeing a problem with throughput between the clients and Cass or between the two Cass nodes. It might not be your case, but that was what happened to me :-) Juan On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir lt; solf.lists@ gt; wrote: Hi, I have a serious problem with counters performance and I can't seem to figure it out. Basically I'm building a system for accumulating some statistics on the fly via Cassandra distributed counters. For this I need counter updates to work really fast and herein lies my problem -- as soon as I enable replication_factor = 2, the performance goes down the drain. This happens in my tests using both 1.0.x and 1.1.6. Let me elaborate: I have two boxes (virtual servers on top of physical servers rented specifically for this purpose, i.e. it's not a cloud, nor it is shared; virtual servers are managed by our admins as a way to limit damage as I suppose :)). Cassandra partitioner is set to ByteOrderedPartitioner because I want to be able to do some range queries. First, I set up Cassandra individually on each box (not in a cluster) and test counter increments performance (exclusively increments, no reads). For tests I use code that is intended to somewhat resemble the expected load pattern -- particularly the majority of increments create new counters with some updating (adding) to already existing counters. In this test each single node exhibits respectable performance - something on the order of 70k (seventy thousand) increments per second. I then join both of these nodes into single cluster (using SimpleSnitch and SimpleStrategy, nothing fancy yet). I then run the same test using replication_factor=1. The performance is on the order of 120k increments per second -- which seems to be a reasonable increase over the single node performance. HOWEVER I then rerun the same test on the two-node cluster using replication_factor=2 -- which is the least I'll need for actual production for redundancy purposes. And the performance I get is absolutely horrible -- much, MUCH worse than even single-node performance -- something on the order of less than 25k increments per second. In addition to clients not being able to push updates fast enough, I also see a lot of 'messages dropped' messages in the Cassandra log under this load. Could anyone advise what could be causing such drastic performance drop under replication_factor=2? I was expecting something on the order of single-node performance, not approximately 3x less. When testing replication_factor=2 on 1.1.6 I can see that CPU usage goes through the roof. On 1.0.x I think it looked more like disk overload, but I'm not sure (being on virtual server I apparently can't see true iostats). I do have Cassandra data on a separate disk, commit log and cache are currently on the same disk as the system. I experimented with commit log flush modes and even with disabling commit log at all -- but it doesn't seem to have noticeable impact on the performance when under replication_factor=2. Any suggestions and hints will be much appreciated :) And please let me know if I need to share additional information about the configuration I'm running on. Best regards, Sergey -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counter s-replication-awful-performance-tp7583993.html Sent from the cassandra-user@.apache mailing list archive at Nabble.com. -- Learn More: SQI (Social Quality Index) - A Universal Measure of
Re: selective replication of keyspaces
You can do something like this: Divide your nodes up into 4 datacenters art1,art2,art3,core [default@unknown] create keyspace art1 placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{art1:2,core:2}]; [default@unknown] create keyspace art2 placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{art2:2,core:2}]; [default@unknown] create keyspace art3 placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{art3:2,core:2}]; [default@unknown] create keyspace core placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{core:2}]; On Tue, Nov 27, 2012 at 5:02 PM, Artist jer...@simpleartmarketing.comwrote: I have 3 art-servers each has a cassandra cluster. Each of the art-servers has config/state information stored in a Keyspaces respectively called art-server-1-current-state, art-server-2-current-state, art-server-3-current-state in my core server I have a separate Cassandra cluster. I would like to use Cassandra to replicate the current-state of each art-server on the core cassandra server without sharing that information with any of the art-servers. Is there is a way to replicate the keyspaces to a single Cassandra cluster my core without having any peer sharing between the 3 art-servers. - Artist -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/selective-replication-of-keyspaces-tp7584007.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: counters + replication = awful performance?
I mispoke really. It is not dangerous you just have to understand what it means. this jira discusses it. https://issues.apache.org/jira/browse/CASSANDRA-3868 On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay sco...@mailchannels.comwrote: We're having a similar performance problem. Setting 'replicate_on_write: false' fixes the performance issue in our tests. How dangerous is it? What exactly could go wrong? On 12-11-27 01:44 PM, Edward Capriolo wrote: The difference between Replication factor =1 and replication factor 1 is significant. Also it sounds like your cluster is 2 node so going from RF=1 to RF=2 means double the load on both nodes. You may want to experiment with the very dangerous column family attribute: - replicate_on_write: Replicate every counter update from the leader to the follower replicas. Accepts the values true and false. Edward On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman mkjell...@barracuda.com wrote: Are you writing with QUORUM consistency or ONE? On 11/27/12 9:52 AM, Sergey Olefir solf.li...@gmail.com wrote: Hi Juan, thanks for your input! In my case, however, I doubt this is the case -- clients are able to push many more updates than I need to saturate replication_factor=2 case (e.g. I'm doing as many as 6x more increments when testing 2-node cluster with replication_factor=1), so bandwidth between clients and server should be sufficient. Bandwidth between nodes in the cluster should also be quite sufficient since they are both in the same DC. But it is something to check, thanks! Best regards, Sergey Juan Valencia wrote Hi Sergey, I know I've had similar issues with counters which were bottle-necked by network throughput. You might be seeing a problem with throughput between the clients and Cass or between the two Cass nodes. It might not be your case, but that was what happened to me :-) Juan On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir lt; solf.lists@ gt; wrote: Hi, I have a serious problem with counters performance and I can't seem to figure it out. Basically I'm building a system for accumulating some statistics on the fly via Cassandra distributed counters. For this I need counter updates to work really fast and herein lies my problem -- as soon as I enable replication_factor = 2, the performance goes down the drain. This happens in my tests using both 1.0.x and 1.1.6. Let me elaborate: I have two boxes (virtual servers on top of physical servers rented specifically for this purpose, i.e. it's not a cloud, nor it is shared; virtual servers are managed by our admins as a way to limit damage as I suppose :)). Cassandra partitioner is set to ByteOrderedPartitioner because I want to be able to do some range queries. First, I set up Cassandra individually on each box (not in a cluster) and test counter increments performance (exclusively increments, no reads). For tests I use code that is intended to somewhat resemble the expected load pattern -- particularly the majority of increments create new counters with some updating (adding) to already existing counters. In this test each single node exhibits respectable performance - something on the order of 70k (seventy thousand) increments per second. I then join both of these nodes into single cluster (using SimpleSnitch and SimpleStrategy, nothing fancy yet). I then run the same test using replication_factor=1. The performance is on the order of 120k increments per second -- which seems to be a reasonable increase over the single node performance. HOWEVER I then rerun the same test on the two-node cluster using replication_factor=2 -- which is the least I'll need for actual production for redundancy purposes. And the performance I get is absolutely horrible -- much, MUCH worse than even single-node performance -- something on the order of less than 25k increments per second. In addition to clients not being able to push updates fast enough, I also see a lot of 'messages dropped' messages in the Cassandra log under this load. Could anyone advise what could be causing such drastic performance drop under replication_factor=2? I was expecting something on the order of single-node performance, not approximately 3x less. When testing replication_factor=2 on 1.1.6 I can see that CPU usage goes through the roof. On 1.0.x I think it looked more like disk overload, but I'm not sure (being on virtual server I apparently can't see true iostats). I do have Cassandra data on a separate disk, commit log and cache are currently on the same disk as the system. I experimented with commit log flush modes and even with disabling commit log at all -- but it doesn't seem to have noticeable impact on the performance when under replication_factor=2. Any suggestions and hints will be much
Re: counters + replication = awful performance?
Cassandra's counters read on increment. Additionally they are distributed so that can be multiple reads on increment. If they are not fast enough and you have avoided all tuning options add more servers to handle the load. In many cases incrementing the same counter n times can be avoided. Twitter's rainbird did just that. It avoided multiple counter increments by batching them. I have done a similar think using cassandra and Kafka. https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java On Tuesday, November 27, 2012, Sergey Olefir solf.li...@gmail.com wrote: Hi, thanks for your suggestions. Regarding replicate=2 vs replicate=1 performance: I expected that below configurations will have similar performance: - single node, replicate = 1 - two nodes, replicate = 2 (okay, this probably should be a bit slower due to additional overhead). However what I'm seeing is that second option (replicate=2) is about THREE times slower than single node. Regarding replicate_on_write -- it is, in fact, a dangerous option. As JIRA discusses, if you make changes to your ring (moving tokens and such) you will *silently* lose data. That is on top of whatever data you might end up losing if you run replicate_on_write=false and the only node that got the data fails. But what is much worse -- with replicate_on_write being false the data will NOT be replicated (in my tests) ever unless you explicitly request the cell. Then it will return the wrong result. And only on subsequent reads it will return adequate results. I haven't tested it, but documentation states that range query will NOT do 'read repair' and thus will not force replication. The test I did went like this: - replicate_on_write = false - write something to node A (which should in theory replicate to node B) - wait for a long time (longest was on the order of 5 hours) - read from node B (and here I was getting null / wrong result) - read from node B again (here you get what you'd expect after read repair) In essence, using replicate_on_write=false with rarely read data will practically defeat the purpose of having replication in the first place (failover, data redundancy). Or, in other words, this option doesn't look to be applicable to my situation. It looks like I will get much better performance by simply writing to two separate clusters rather than using single cluster with replicate=2. Which is kind of stupid :) I think something's fishy with counters and replication. Edward Capriolo wrote I mispoke really. It is not dangerous you just have to understand what it means. this jira discusses it. https://issues.apache.org/jira/browse/CASSANDRA-3868 On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay lt; scottm@ gt;wrote: We're having a similar performance problem. Setting 'replicate_on_write: false' fixes the performance issue in our tests. How dangerous is it? What exactly could go wrong? On 12-11-27 01:44 PM, Edward Capriolo wrote: The difference between Replication factor =1 and replication factor 1 is significant. Also it sounds like your cluster is 2 node so going from RF=1 to RF=2 means double the load on both nodes. You may want to experiment with the very dangerous column family attribute: - replicate_on_write: Replicate every counter update from the leader to the follower replicas. Accepts the values true and false. Edward On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman mkjellman@ wrote: Are you writing with QUORUM consistency or ONE? On 11/27/12 9:52 AM, Sergey Olefir lt; solf.lists@ gt; wrote: Hi Juan, thanks for your input! In my case, however, I doubt this is the case -- clients are able to push many more updates than I need to saturate replication_factor=2 case (e.g. I'm doing as many as 6x more increments when testing 2-node cluster with replication_factor=1), so bandwidth between clients and server should be sufficient. Bandwidth between nodes in the cluster should also be quite sufficient since they are both in the same DC. But it is something to check, thanks! Best regards, Sergey Juan Valencia wrote Hi Sergey, I know I've had similar issues with counters which were bottle-necked by network throughput. You might be seeing a problem with throughput between the clients and Cass or between the two Cass nodes. It might not be your case, but that was what happened to me :-) Juan On Tue, Nov 27, 2012 at 8:48 AM, Sergey Olefir lt; solf.lists@ gt; wrote: Hi, I have a serious problem with counters performance and I can't seem to figure it out. Basically I'm building a system for accumulating some statistics on the fly via Cassandra distributed counters. For this I need counter updates to work really fast and herein lies my problem -- as soon as I enable replication_factor = 2
Re: counters + replication = awful performance?
By the way the other issues you are seeing with replicate on write at false could be because you did not repair. You should do that when changing rf. On Tuesday, November 27, 2012, Edward Capriolo edlinuxg...@gmail.com wrote: Cassandra's counters read on increment. Additionally they are distributed so that can be multiple reads on increment. If they are not fast enough and you have avoided all tuning options add more servers to handle the load. In many cases incrementing the same counter n times can be avoided. Twitter's rainbird did just that. It avoided multiple counter increments by batching them. I have done a similar think using cassandra and Kafka. https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java On Tuesday, November 27, 2012, Sergey Olefir solf.li...@gmail.com wrote: Hi, thanks for your suggestions. Regarding replicate=2 vs replicate=1 performance: I expected that below configurations will have similar performance: - single node, replicate = 1 - two nodes, replicate = 2 (okay, this probably should be a bit slower due to additional overhead). However what I'm seeing is that second option (replicate=2) is about THREE times slower than single node. Regarding replicate_on_write -- it is, in fact, a dangerous option. As JIRA discusses, if you make changes to your ring (moving tokens and such) you will *silently* lose data. That is on top of whatever data you might end up losing if you run replicate_on_write=false and the only node that got the data fails. But what is much worse -- with replicate_on_write being false the data will NOT be replicated (in my tests) ever unless you explicitly request the cell. Then it will return the wrong result. And only on subsequent reads it will return adequate results. I haven't tested it, but documentation states that range query will NOT do 'read repair' and thus will not force replication. The test I did went like this: - replicate_on_write = false - write something to node A (which should in theory replicate to node B) - wait for a long time (longest was on the order of 5 hours) - read from node B (and here I was getting null / wrong result) - read from node B again (here you get what you'd expect after read repair) In essence, using replicate_on_write=false with rarely read data will practically defeat the purpose of having replication in the first place (failover, data redundancy). Or, in other words, this option doesn't look to be applicable to my situation. It looks like I will get much better performance by simply writing to two separate clusters rather than using single cluster with replicate=2. Which is kind of stupid :) I think something's fishy with counters and replication. Edward Capriolo wrote I mispoke really. It is not dangerous you just have to understand what it means. this jira discusses it. https://issues.apache.org/jira/browse/CASSANDRA-3868 On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay lt; scottm@ gt;wrote: We're having a similar performance problem. Setting 'replicate_on_write: false' fixes the performance issue in our tests. How dangerous is it? What exactly could go wrong? On 12-11-27 01:44 PM, Edward Capriolo wrote: The difference between Replication factor =1 and replication factor 1 is significant. Also it sounds like your cluster is 2 node so going from RF=1 to RF=2 means double the load on both nodes. You may want to experiment with the very dangerous column family attribute: - replicate_on_write: Replicate every counter update from the leader to the follower replicas. Accepts the values true and false. Edward On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman mkjellman@ wrote: Are you writing with QUORUM consistency or ONE? On 11/27/12 9:52 AM, Sergey Olefir lt; solf.lists@ gt; wrote: Hi Juan, thanks for your input! In my case, however, I doubt this is the case -- clients are able to push many more updates than I need to saturate replication_factor=2 case (e.g. I'm doing as many as 6x more increments when testing 2-node cluster with replication_factor=1), so bandwidth between clients and server should be sufficient. Bandwidth between nodes in the cluster should also be 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- *Scott McKay*, Sr. Software Developer MailChannels Tel: +1 604 685 7488 x 509 www.mailchannels.com -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/counters-replication-awful-performance-tp7583993p7584011.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: counters + replication = awful performance?
Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node. If you go to rf2 that is 100 inserts a node. If you were at 75 % capacity on each mode your now at 150% which is not possible so things bog down. To figure out what is going on we would need to see tpstat, iostat , and top information. I think your looking at the performance the wrong way. Starting off at rf 1 is not the way to understand cassandra performance. You do not get the benefits of scala out don't happen until you fix your rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf 3 even better. On Tuesday, November 27, 2012, Sergey Olefir solf.li...@gmail.com wrote: I already do a lot of in-memory aggregation before writing to Cassandra. The question here is what is wrong with Cassandra (or its configuration) that causes huge performance drop when moving from 1-replication to 2-replication for counters -- and more importantly how to resolve the problem. 2x-3x drop when moving from 1-replication to 2-replication on two nodes is reasonable. 6x is not. Like I said, with this kind of performance degradation it makes more sense to run two clusters with replication=1 in parallel rather than rely on Cassandra replication. And yes, Rainbird was the inspiration for what we are trying to do here :) Edward Capriolo wrote Cassandra's counters read on increment. Additionally they are distributed so that can be multiple reads on increment. If they are not fast enough and you have avoided all tuning options add more servers to handle the load. In many cases incrementing the same counter n times can be avoided. Twitter's rainbird did just that. It avoided multiple counter increments by batching them. I have done a similar think using cassandra and Kafka. https://github.com/edwardcapriolo/IronCount/blob/master/src/test/java/com/jointhegrid/ironcount/mockingbird/MockingBirdMessageHandler.java On Tuesday, November 27, 2012, Sergey Olefir lt; solf.lists@ gt; wrote: Hi, thanks for your suggestions. Regarding replicate=2 vs replicate=1 performance: I expected that below configurations will have similar performance: - single node, replicate = 1 - two nodes, replicate = 2 (okay, this probably should be a bit slower due to additional overhead). However what I'm seeing is that second option (replicate=2) is about THREE times slower than single node. Regarding replicate_on_write -- it is, in fact, a dangerous option. As JIRA discusses, if you make changes to your ring (moving tokens and such) you will *silently* lose data. That is on top of whatever data you might end up losing if you run replicate_on_write=false and the only node that got the data fails. But what is much worse -- with replicate_on_write being false the data will NOT be replicated (in my tests) ever unless you explicitly request the cell. Then it will return the wrong result. And only on subsequent reads it will return adequate results. I haven't tested it, but documentation states that range query will NOT do 'read repair' and thus will not force replication. The test I did went like this: - replicate_on_write = false - write something to node A (which should in theory replicate to node B) - wait for a long time (longest was on the order of 5 hours) - read from node B (and here I was getting null / wrong result) - read from node B again (here you get what you'd expect after read repair) In essence, using replicate_on_write=false with rarely read data will practically defeat the purpose of having replication in the first place (failover, data redundancy). Or, in other words, this option doesn't look to be applicable to my situation. It looks like I will get much better performance by simply writing to two separate clusters rather than using single cluster with replicate=2. Which is kind of stupid :) I think something's fishy with counters and replication. Edward Capriolo wrote I mispoke really. It is not dangerous you just have to understand what it means. this jira discusses it. https://issues.apache.org/jira/browse/CASSANDRA-3868 On Tue, Nov 27, 2012 at 6:13 PM, Scott McKay lt; scottm@ gt;wrote: We're having a similar performance problem. Setting 'replicate_on_write: false' fixes the performance issue in our tests. How dangerous is it? What exactly could go wrong? On 12-11-27 01:44 PM, Edward Capriolo wrote: The difference between Replication factor =1 and replication factor 1 is significant. Also it sounds like your cluster is 2 node so going from RF=1 to RF=2 means double the load on both nodes. You may want to experiment with the very dangerous column family attribute: - replicate_on_write: Replicate every counter update from the leader to the follower replicas. Accepts the values true and false. Edward On Tue, Nov 27, 2012 at 1:02 PM, Michael Kjellman mkjellman@ wrote: Are you writing with QUORUM consistency
Re: selective replication of keyspaces
My mistake that is older cli syntax, I wad just showing the concept set up 4 datacenter and selectively replicate keyspaces between them. On Tuesday, November 27, 2012, jer...@simpleartmarketing.com jer...@simpleartmarketing.com wrote: Thank you. This is a good start I was beginning to think it couldn't be done. When I run the command I get the error syntax error at position 21: missing EOF at 'placement_strategy' that is probably because I still need to set the correct properties in the conf files On November 27, 2012 at 5:41 PM Edward Capriolo edlinuxg...@gmail.com wrote: You can do something like this: Divide your nodes up into 4 datacenters art1,art2,art3,core [default@unknown] create keyspace art1 placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{art1:2,core:2}]; [default@unknown] create keyspace art2 placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{art2:2,core:2}]; [default@unknown] create keyspace art3 placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{art3:2,core:2}]; [default@unknown] create keyspace core placement_strategy = 'org.apache.cassandra.locator.NetworkTopologyStrategy' and strategy_options=[{core:2}]; On Tue, Nov 27, 2012 at 5:02 PM, Artist jer...@simpleartmarketing.com wrote: I have 3 art-servers each has a cassandra cluster. Each of the art-servers has config/state information stored in a Keyspaces respectively called art-server-1-current-state, art-server-2-current-state, art-server-3-current-state in my core server I have a separate Cassandra cluster. I would like to use Cassandra to replicate the current-state of each art-server on the core cassandra server without sharing that information with any of the art-servers. Is there is a way to replicate the keyspaces to a single Cassandra cluster my core without having any peer sharing between the 3 art-servers. - Artist -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/selective-replication-of-keyspaces-tp7584007.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Generic questions over Cassandra 1.1/1.2
@Bill Are you saying that now cassandra is less schema less ? :) Compact storage is the schemaless of old. On Tuesday, November 27, 2012, Bill de hÓra b...@dehora.net wrote: I'm not sure I always understand what people mean by schema less exactly and I'm curious. For 'schema less', given this - {{{ cqlsh use example; cqlsh:example CREATE TABLE users ( ... user_name varchar, ... password varchar, ... gender varchar, ... session_token varchar, ... state varchar, ... birth_year bigint, ... PRIMARY KEY (user_name) ... ); }}} I expect this would not cause an unknown identifier error - {{{ INSERT INTO users (user_name, password, extra, moar) VALUES ('bob', 'secret', 'a', 'b'); }}} but definitions vary. Bill On 26/11/12 09:18, Sylvain Lebresne wrote: On Mon, Nov 26, 2012 at 8:41 AM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: Is there any noticeable performance difference between thrift or CQL3? Off the top of my head it's within 5% (maybe 10%) under stress tests. See Eric's talk at the Cassandra SF conference for the exact numbers. Eric's benchmark results was that normal queries were slightly slower but prepared one (and in real life, I see no good reason not to prepare statements) were actually slightly faster. CQL 3 requires a schema, however altering the schema is easier. And in 1.2 will support concurrent schema modifications. Thrift API is still schema less. Sorry to hijack this thread, but I'd be curious (like seriously, I'm not trolling) to understand what you mean by CQL 3 requires a schema but Thrift API is still schema less. Basically I'm not sure I always understand what people mean by schema less exactly and I'm curious. -- Sylvain
Re: counters + replication = awful performance?
I may be wrong but during a bootstrap hints can be silently discarded, if the node they are destined for leaves the ring. There are a large number of people using counters for 5 minute real-time statistics. On the back end they use ETL based reporting to compute the true stats at a hourly or daily interval. A user like this might benefit from DANGER counters. They are not looking for perfection, only better performance, and the counter row keys themselves role over in 5 minutes anyway. Options like this are also great for winning benchmarks. When someone other NoSQL (that is not has fast as c*) wants to win a benchmark they turn off/on WAL, or write acks, or something that compromises their ACID/CAP story for the purpose of winning. We need our own secret awesome-sauce dangerous options too! jk On Wed, Nov 28, 2012 at 4:21 AM, Rob Coli rc...@palominodb.com wrote: On Tue, Nov 27, 2012 at 3:21 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I mispoke really. It is not dangerous you just have to understand what it means. this jira discusses it. https://issues.apache.org/jira/browse/CASSANDRA-3868 Per Sylvain on the referenced ticket : I don't disagree about the efficiency of the valve, but at what price? 'Bootstrapping a node will make you lose increments (you don't know which ones, you don't know how many and this even if nothing goes wrong)' is a pretty bad drawback. That is pretty much why that option makes me uncomfortable: it does give you better performance, so people may be tempted to use it. Now if it was only a matter of replicating writes only through read-repair/repair, then ok, it's pretty dangerous but it's rather easy to explain/understand the drawback (if you don't lose a disk, you don't lose increments, and you'd better use CL.ALL or have read_repair_chance to 1). But the fact that it doesn't work with bootstrap/move makes me wonder if having the option at all is not making a disservice to users. To me anything that can be described as will make you lose increments (you don't know which ones, you don't know how many and this even if nothing goes wrong) and which therefore doesn't work with bootstrap/move is correctly described as dangerous. :D =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: counters + replication = awful performance?
Just for reference HBase's counters also do a local read. I am not saying they work better/worse/faster/slower but I would not suspect any system that reads on increment to me significantly faster then what Cassandra does. Just saying your counter throughput is read bound, this is not unique to C*'s implementation. On Wed, Nov 28, 2012 at 2:41 PM, Sergey Olefir solf.li...@gmail.com wrote: Well, those are sad news then. I don't think I can consider 20k increments per second for a two node cluster (with RF=2) a reasonable performance (cost vs. benefit). I might have to look into other storage solutions or perhaps experiment with duplicate clusters with RF=1 or replicate_on_write=false. Although yes, I probably should try that row cache you mentioned -- I saw that key cache was going unused (so saw no reason to try to enable row cache), but I think it was on RF=1, it might be different on RF=2. Sylvain Lebresne-3 wrote Counters replication works in different ways than the one of normal writes. Namely, a counter update is written to a first replica, then a read is perform and the result of that is replicated to the other nodes. With RF=1, since there is only one replica no read is involved but in a way it's a degenerate case. So there is two reason why RF2 is much slower than RF=1: 1) it involves a read to replicate and that read takes times. Especially if that read hits the disk, it may even dominate the insertion time. 2) the replication to the first replica and the one to the res of the replica are not done in parallel but sequentially. Note that this is only true for the first replica versus the othere. In other words, from RF=2 to RF=3 you should see a significant performance degradation. Note that while there is nothing you can do for 2), you can try to speed up 1) by using row cache for instance (in case you weren't). In other words, with counters, it is expected that RF=1 be potentially much faster than RF1. That is the way counters works. And don't get me wrong, I'm not suggesting you should use RF=1 at all. What I am saying is that the performance you see with RF=2 is the performance of counters in Cassandra. -- Sylvain On Wed, Nov 28, 2012 at 7:34 AM, Sergey Olefir lt; solf.lists@ gt; wrote: I think there might be a misunderstanding as to the nature of the problem. Say, I have test set T. And I have two identical servers A and B. - I tested that server A (singly) is able to handle load of T. - I tested that server B (singly) is able to handle load of T. - I then join A and B in the cluster and set replication=2 -- this means that each server in effect has to handle full test load individually (because there are two servers and replication=2 it means that each server effectively has to handle all the data written to the cluster). Under these circumstances it is reasonable to assume that cluster A+B shall be able to handle load T because each server is able to do so individually. HOWEVER, this is not the case. In fact, A+B together are only able to handle less than 1/3 of T DESPITE the fact that A and B individually are able to handle T just fine. I think there's something wrong with Cassandra replication (possibly as simple as me misconfiguring something) -- it shouldn't be three times faster to write to two separate nodes in parallel as compared to writing to 2-node Cassandra cluster with replication=2. Edward Capriolo wrote Say you are doing 100 inserts rf1 on two nodes. That is 50 inserts a node. If you go to rf2 that is 100 inserts a node. If you were at 75 % capacity on each mode your now at 150% which is not possible so things bog down. To figure out what is going on we would need to see tpstat, iostat , and top information. I think your looking at the performance the wrong way. Starting off at rf 1 is not the way to understand cassandra performance. You do not get the benefits of scala out don't happen until you fix your rf and increment your nodecount. Ie 5 nodes at rf 3 is fast 10 nodes at rf 3 even better. On Tuesday, November 27, 2012, Sergey Olefir lt; solf.lists@ gt; wrote: I already do a lot of in-memory aggregation before writing to Cassandra. The question here is what is wrong with Cassandra (or its configuration) that causes huge performance drop when moving from 1-replication to 2-replication for counters -- and more importantly how to resolve the problem. 2x-3x drop when moving from 1-replication to 2-replication on two nodes is reasonable. 6x is not. Like I said, with this kind of performance degradation it makes more sense to run two clusters with replication=1 in parallel rather than rely on Cassandra replication. And yes, Rainbird was the inspiration for what we are trying to do here
Re: Java high-level client
Astyanax is a hector fork. You can see many of the hector' authors comments still in the astyanax code. There is some nice stuff in there but (IMHO) I do not see the fork as necessary. It has split up the community a bit, as there are now 3 high level Java clients. I would advice follow Josh's advice http://www.youtube.com/watch?v=nPG4sK_glls . Go to reddit and select whatever sexy technology is new and trending :) On Wed, Nov 28, 2012 at 2:51 PM, Michael Kjellman mkjell...@barracuda.comwrote: Lots of example code, nice api, good performance as the first things that come to mind why I like Astyanax better than Hector From: Andrey Ilinykh ailin...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Wednesday, November 28, 2012 11:49 AM To: user@cassandra.apache.org user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com Subject: Re: Java high-level client First at all, it is backed by Netflix. They used it production for long time, so it is pretty solid. Also they have nice tool (Priam) which makes cassandra cloud (AWS) friendly. This is important for us. Andrey On Wed, Nov 28, 2012 at 11:53 AM, Wei Zhu wz1...@yahoo.com wrote: We are using Hector now. What is the major advantage of astyanax over Hector? Thanks. -Wei -- *From:* Andrey Ilinykh ailin...@gmail.com *To:* user@cassandra.apache.org *Sent:* Wednesday, November 28, 2012 9:37 AM *Subject:* Re: Java high-level client +1 On Tue, Nov 27, 2012 at 10:10 AM, Michael Kjellman mkjell...@barracuda.com wrote: Netflix has a great client https://github.com/Netflix/astyanax -- 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Re: Rename cluster
Since the cluster name is only cosmetic people do not often change it. I would not do this in a production cluster for sure. On Thu, Nov 29, 2012 at 2:56 PM, Wei Zhu wz1...@yahoo.com wrote: Hi, I am trying to rename a cluster by following the instruction on Wiki: Cassandra says ClusterName mismatch: oldClusterName != newClusterName and refuses to start To prevent operator errors, Cassandra stores the name of the cluster in its system table. If you need to rename a cluster for some reason, you can: Perform these steps on each node: 1. Start the cassandra-cli connected locally to this node. 2. Run the following: 1. use system; 2. set LocationInfo http://wiki.apache.org/cassandra/LocationInfo [utf8('L')][utf8('ClusterNamehttp://wiki.apache.org/cassandra/ClusterName')]=utf8('new cluster name'); 3. exit; 3. Run nodetool flush on this node. 4. Update the cassandra.yaml file for the cluster_name as the same as 2b). 5. Restart the node. Once all nodes have been had this operation performed and restarted, nodetool ring should show all nodes as UP. Get the following error: Connected to: Test Cluster on 10.200.128.151/9160 Welcome to Cassandra CLI version 1.1.6 Type 'help;' or '?' for help. Type 'quit;' or 'exit;' to quit. [default@unknown] use system; Authenticated to keyspace: system [default@system] set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('General Services Cluster'); system keyspace is not user-modifiable. InvalidRequestException(why:system keyspace is not user-modifiable.) at org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15974) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:797) at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:781) at org.apache.cassandra.cli.CliClient.executeSet(CliClient.java:909) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:222) at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219) at org.apache.cassandra.cli.CliMain.main(CliMain.java:346) I have to remove the data directory in order to change the cluster name. Luckily it's my testing box, so no harm. Just wondering what has been changed not to allow the modification through cli? What is the way of changing the cluster name without wiping out all the data now? Thanks. -Wei
Re: Row caching + Wide row column family == almost crashed?
Row cache has to store the entire row. It is a very bad option for wide rows. On Sunday, December 2, 2012, Mike mthero...@yahoo.com wrote: Hello, We recently hit an issue within our Cassandra based application. We have a relatively new Column Family with some very wide rows (10's of thousands of columns, or more in some cases). During a periodic activity, we the range of columns to retrieve various pieces of information, a segment at a time. We do these same queries frequently at various stages of the process, and I thought the application could see a performance benefit from row caching. We have a small row cache (100MB per node) already enabled, and I enabled row caching on the new column family. The results were very negative. When performing range queries with a limit of 200 results, for a small minority of the rows in the new column family, performance plummeted. CPU utilization on the Cassandra node went through the roof, and it started chewing up memory. Some queries to this column family hung completely. According to the logs, we started getting frequent GCInspector messages. Cassandra started flushing the largest mem_tables due to hitting the flush_largest_memtables_at of 75%, and scaling back the key/row caches. However, to Cassandra's credit, it did not die with an OutOfMemory error. Its measures to emergency measures to conserve memory worked, and the cluster stayed up and running. No real errors showed in the logs, except for Messages getting drop, which I believe was caused by what was going on with CPU and memory. Disabling row caching on this new column family has resolved the issue for now, but, is there something fundamental about row caching that I am missing? We are running Cassandra 1.1.2 with a 6 node cluster, with a replication factor of 3. Thanks, -Mike
Re: What is substituting keys_cached column family argument
Rob, Have you played with this I have many CFs, some big some small some using large caches some using small ones, some that take many requests, some that take a few. Over time I have cooked up a strategy for how to share the cache love, even thought it may not be the best solution to the problem I feel it makes sense. I can not figure out how I am going to be happy with global caches that I do not control the size. What is your take on this? Edward On Wed, Dec 5, 2012 at 2:05 PM, Rob Coli rc...@palominodb.com wrote: On Wed, Dec 5, 2012 at 9:06 AM, Roman Yankin ro...@cognitivematch.com wrote: In Cassandra v 0.7 there was a column family property called keys_cached, now it's gone and I'm struggling to understand which of the below properties it's now substituted (if substituted at all)? Key and row caches are global in modern cassandra. You opt CFs out of the key cache, not opt in, because the default setting is keys_only on a per-CF basis. http://www.datastax.com/docs/1.1/configuration/node_configuration#row-cache-keys-to-save http://www.datastax.com/docs/1.1/configuration/node_configuration#key-cache-keys-to-save http://www.datastax.com/docs/1.1/configuration/storage_configuration#caching =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.
http://wiki.apache.org/cassandra/LargeDataSetConsiderations On Thu, Dec 6, 2012 at 9:53 AM, Poziombka, Wade L wade.l.poziom...@intel.com wrote: “Having so much data on each node is a potential bad day.” ** ** Is this discussed somewhere on the Cassandra documentation (limits, practices etc)? We are also trying to load up quite a lot of data and have hit memory issues (bloom filter etc.) in 1.0.10. I would like to read up on big data usage of Cassandra. Meaning terabyte size databases. ** ** I do get your point about the amount of time required to recover downed node. But this 300-400MB business is interesting to me. ** ** Thanks in advance. ** ** Wade ** ** *From:* aaron morton [mailto:aa...@thelastpickle.com] *Sent:* Wednesday, December 05, 2012 9:23 PM *To:* user@cassandra.apache.org *Subject:* Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction. ** ** Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected! I would recommend having up to 300MB to 400MB per node on a regular HDD with 1GB networking. ** ** But on the 3rd node, we suspect major compaction didn't actually finish it's job… The file list looks odd. Check the time stamps, on the files. You should not have files older than when compaction started. ** ** 8GB heap The default is 4GB max now days. ** ** 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? I cannot answer that. ** ** 2) Should we restart with leveled compaction next year? I would run some tests to see how it works for you workload. ** ** 4) Should we consider increasing the cluster capacity? IMHO yes. You may also want to do some experiments with turing compression on if it not already enabled. ** ** Having so much data on each node is a potential bad day. If instead you had to move or repair one of those nodes how long would it take for cassandra to stream all the data over ? (Or to rsync the data over.) How long does it take to run nodetool repair on the node ? ** ** With RF 3, if you lose a node you have lost your redundancy. It's important to have a plan about how to get it back and how long it may take. ** ** Hope that helps. ** ** - Aaron Morton Freelance Cassandra Developer New Zealand ** ** @aaronmorton http://www.thelastpickle.com ** ** On 6/12/2012, at 3:40 AM, Alexandru Sicoe adsi...@gmail.com wrote: Hi guys, Sorry for the late follow-up but I waited to run major compactions on all 3 nodes at a time before replying with my findings. Basically we were successful on two of the nodes. They both took ~2 days and 11 hours to complete and at the end we saw one very large file ~900GB and the rest much smaller (the overall size decreased). This is what we expected! But on the 3rd node, we suspect major compaction didn't actually finish it's job. First of all nodetool compact returned much earlier than the rest - after one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 36GB were freed up (almost the same size as before). Saw nothing in the server log (debug not enabled). Below I pasted some more details about file sizes before and after compaction on this third node and disk occupancy. The situation is maybe not so dramatic for us because in less than 2 weeks we will have a down time till after the new year. During this we can completely delete all the data in the cluster and start fresh with TTLs for 1 month (as suggested by Aaron and 8GB heap as suggested by Alain - thanks). Questions: 1) Do you expect problems with the 3rd node during 2 weeks more of operations, in the conditions seen below? [Note: we expect the minor compactions to continue building up files but never really getting to compacting the large file and thus not needing much temporarily extra disk space]. 2) Should we restart with leveled compaction next year? [Note: Aaron was right, we have 1 week rows which get deleted after 1 month which means older rows end up in big files = to free up space with SizeTiered we will have no choice but run major compactions which we don't know if they will work provided that we get at ~1TB / node / 1 month. You can see we are at the limit!] 3) In case we keep SizeTiered: - How can we improve the performance of our major compactions? (we left all config parameters as default). Would increasing compactions throughput interfere with writes and reads? What about multi-threaded compactions? - Do we still need to run regular repair operations as well? Do these also do a major compaction
Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count?
Good point . hadoop sprays its blocks around randomly. Thus if replication factor nodes are down some blocks are not found. The larger the cluster the higher chance nodes are down. To deal with this increase rf once the cluster gets to be very large. On Wednesday, December 5, 2012, Eric Parusel ericparu...@gmail.com wrote: Hi all, I've been wondering about virtual nodes and how cluster uptime might change as cluster size increases. I understand clusters will benefit from increased reliability due to faster rebuild time, but does that hold true for large clusters? It seems that since (and correct me if I'm wrong here) every physical node will likely share some small amount of data with every other node, that as the count of physical nodes in a Cassandra cluster increases (let's say into the triple digits) that the probability of at least one failure to Quorum read/write occurring in a given time period would *increase*. Would this hold true, at least until physical nodes becomes greater than num_tokens per node? I understand that the window of failure for affected ranges would probably be small but we do Quorum reads of many keys, so we'd likely hit every virtual range with our queries, even if num_tokens was 256. Thanks, Eric
Re: Virtual Nodes, lots of physical nodes and potentially increasing outage count?
Assuming you need to work with quorum in a non-vnode scenario. That means that if 2 nodes in a row in the ring are down some number of quorum operations will fail with UnavailableException (TimeoutException right after the failures). This is because the for a given range of tokens quorum will be impossible, but quorum will be possible for others. In a vnode world if any two nodes are down, then the intersection of vnode token ranges they have are unavailable. I think it is two sides of the same coin. On Mon, Dec 10, 2012 at 7:41 AM, Richard Low r...@acunu.com wrote: Hi Tyler, You're right, the math does assume independence which is unlikely to be accurate. But if you do have correlated failure modes e.g. same power, racks, DC, etc. then you can still use Cassandra's rack-aware or DC-aware features to ensure replicas are spread around so your cluster can survive the correlated failure mode. So I would expect vnodes to improve uptime in all scenarios, but haven't done the math to prove it. Richard.
Re: Why Secondary indexes is so slowly by my test?
Until the secondary indexes do not read before write is in a release and stabilized you should follow Ed ENuff s blog and do your indexing yourself with composites. On Thursday, December 13, 2012, aaron morton aa...@thelastpickle.com wrote: The IndexClause for the get_indexed_slices takes a start key. You can page the results from your secondary index query by making multiple calls with a sane count and including a start key. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/12/2012, at 6:34 PM, Chengying Fang cyf...@ngnsoft.com wrote: You are right, Dean. It's due to the heavy result returned by query, not index itself. According to my test, if the result rows less than 5000, it's very quick. But how to limit the result? It seems row limit is a good choice. But if do so, some rows I wanted maybe miss because the row order not fulfill query conditions. For example: CF User{I1,C1} with Index I1. Query conditions:I1=foo, order by C1. If I1=foo return 1 limit 100, I can't get the right result of C1. Also we can not always set row range fulfill the query conditions when doing query. Maybe I should redesign the CF model to fix it. -- Original -- From: Hiller, Deandean.hil...@nrel.gov; Date: Wed, Dec 12, 2012 10:51 PM To: user@cassandra.apache.orguser@cassandra.apache.org; Subject: Re: Why Secondary indexes is so slowly by my test? You could always try PlayOrm's query capability on top of cassandra ;)….it works for us. Dean From: Chengying Fang cyf...@ngnsoft.commailto:cyf...@ngnsoft.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, December 11, 2012 8:22 PM To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Why Secondary indexes is so slowly by my test? Thanks to Low. We use CompositeColumn to substitue it in single not-equality and definite equalitys query. And we will give up cassandra because of the weak query ability and unstability. Many times, we found our data in confusion without definite cause in our cluster. For example, only two rows in one CF, row1-columnname1-columnvalue1,row2-columnname2-columnvalue2, but some times, it becomes row1-columnname1-columnvalue2,row2-columnname2-columnvalue1. Notice the wrong column value. -- Original -- From: Richard Lowr...@acunu.commailto:r...@acunu.com; Date: Tue, Dec 11, 2012 07:44 PM To: useruser@cassandra.apache.orgmailto:user@cassandra.apache.org; Subject: Re: Why Secondary indexes is so slowly by my test? Hi, Secondary index lookups are more complicated than normal queries so will be slower. Items have to first be queried in the index, then retrieved from their actual location. Also, inserting into indexed CFs will be slower (but will get substantially faster in 1.2 due
Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra
It should be good stuff. Brian eats this stuff for lunch. On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote: FWIW -- I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series: http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html I hope to make CQL part of the presentation and show how it integrates with the Java APIs. If you are interested, drop in. -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://brianoneill.blogspot.com/ twitter: @boneill42
Re: Help on MMap of SSTables
This issue has to be looked from a micro and macro level. On the microlevel the best way is workload specific. On the macro level this mostly boils down to data and memory size. Companions are going to churn cache, this is unavoidable. Imho solid state makes the micro optimization meanless in the big picture. Not that we should not consider tweaking flags but just saying it is hard to believe anything like that is a game change. On Monday, December 10, 2012, Rob Coli rc...@palominodb.com wrote: On Thu, Dec 6, 2012 at 7:36 PM, aaron morton aa...@thelastpickle.com wrote: So for memory mapped files, compaction can do a madvise SEQUENTIAL instead of current DONTNEED flag after detecting appropriate OS versions. Will this help? AFAIK Compaction does use memory mapped file access. The history : https://issues.apache.org/jira/browse/CASSANDRA-1470 =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Why Secondary indexes is so slowly by my test?
Here is a good start. http://www.anuff.com/2011/02/indexing-in-cassandra.html On Thu, Dec 13, 2012 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Edward, can you share the link to this blog ? Alain 2012/12/13 Edward Capriolo edlinuxg...@gmail.com Ed ENuff s
Re: Read operations resulting in a write?
Is there a way to turn this on and off through configuration? I am not necessarily sure I would want this feature. Also it is confusing if these writes show up in JMX and look like user generated write operations. On Mon, Dec 17, 2012 at 10:01 AM, Mike mthero...@yahoo.com wrote: Thank you Aaron, this was very helpful. Could it be an issue that this optimization does not really take effect until the memtable with the hoisted data is flushed? In my simple example below, the same row is updated and multiple selects of the same row will result in multiple writes to the memtable. It seems it maybe possible (although unlikely) where, if you go from a write-mostly to a read-mostly scenario, you could get into a state where you are stuck rewriting to the same memtable, and the memtable is not flushed because it absorbs the over-writes. I can foresee this especially if you are reading the same rows repeatedly. I also noticed from the codepaths that if Row caching is enabled, this optimization will not occur. We made some changes this weekend to make this column family more suitable to row-caching and enabled row-caching with a small cache. Our initial results is that it seems to have corrected the write counts, and has increased performance quite a bit. However, are there any hidden gotcha's there because this optimization is not occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a compaction is behind problem. Any history on that? I couldn't find too much information on it. Thanks, -Mike On 12/16/2012 8:41 PM, aaron morton wrote: 1) Am I reading things correctly? Yes. If you do a read/slice by name and more than min compaction level nodes where read the data is re-written so that the next read uses fewer SSTables. 2) What is really happening here? Essentially minor compactions can occur between 4 and 32 memtable flushes. Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every select statement will perform a write. Yup, only for readying a row where the column names are specified. Remember minor compaction when using SizedTiered Compaction (the default) works on buckets of the same size. Imagine a row that had been around for a while and had fragments in more than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If this row is read it will get hoisted back up. But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it will not hoisted. There are a few short circuits in the SliceByName read path. One of them is to end the search when we know that no other SSTables contain columns that should be considered. So if the 4 columns you read frequently are hoisted into the 1st bucket your reads will get handled by that one bucket. It's not every select. Just those that touched more the min compaction sstables. 3) Is this desired behavior? Is there something else I should be looking at that could be causing this behavior? Yes. https://issues.apache.org/jira/browse/CASSANDRA-2503 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/12/2012, at 12:58 PM, Michael Theroux mthero...@yahoo.com wrote: Hello, We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment. I also think I see where this issue is occurring in the code. We have a specific column family that is under heavy read and write load on a nightly basis. For the purposes of this description, I'll refer to this column family as Bob. During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load. The application is such that when something is written to Bob, a write is made to one of two other tables. We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3-10. This is based on the WriteCount available on the column family JMX MBean. We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect. I brought up a test node to experiment, and see a situation where, when a select statement is executed, a write will occur. In my test, I perform the following (switching between nodetool and cqlsh): update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex key'; nodetool flush update bob set 'about'='coworker' where key='hex
Re: rpc_timeout exception while inserting
CQL2 and CQL3 indexes are not compatible. I guess CQL2 is able to detect that the table was defined in CQL3 probably should not allow it. Backwards comparability is something the storage engines and interfaces have to account for. At least they should prevent you from hurting yourself. But do not try to defeat the system. Just stick with one CQL version. On Tue, Dec 18, 2012 at 7:37 AM, Abhijit Chanda abhijit.chan...@gmail.comwrote: I was trying to mix CQL2 and CQL3 to check whether a columnfamily with compound keys can be further indexed. Because using CQL3 secondary indexing on table with composite PRIMARY KEY is not possible. And surprisingly by mixing the CQL versions i was able to do so. But when i want to insert anything in the column family it gives me a rpc_timeout exception. I personally found it quite abnormal, so thought of posting this thing in forum. Best, On Mon, Dec 10, 2012 at 6:29 PM, Sylvain Lebresne sylv...@datastax.comwrote: On Mon, Dec 10, 2012 at 12:36 PM, Abhijit Chanda abhijit.chan...@gmail.com wrote: Hi All, I have a column family which structure is CREATE TABLE practice ( id text, name text, addr text, pin text, PRIMARY KEY (id, name) ) WITH comment='' AND caching='KEYS_ONLY' AND read_repair_chance=0.10 AND gc_grace_seconds=864000 AND replicate_on_write='true' AND compaction_strategy_class='SizeTieredCompactionStrategy' AND compression_parameters:sstable_compression='SnappyCompressor'; CREATE INDEX idx_address ON practice (addr); Initially i have made the column family using CQL 3.0.0. Then for creating the index i have used CQL 2.0. Now when want to insert any data in the column family it always shows a timeout exception. INSERT INTO practice (id, name, addr,pin) VALUES ( '1','AB','kolkata','700052'); Request did not complete within rpc_timeout. Please suggest me where i am getting wrong? That would be creating the index through CQL 2. Why did you use CQL 3 for the CF creation and CQL 2 for the index one? If you do both in CQL 3, that should work as expected. That being said, you should probably not get timeouts (that won't do what you want though). If you look at the server log, do you have an exception there? -- Sylvain -- Abhijit Chanda Analyst VeHere Interactive Pvt. Ltd. +91-974395
Re: Monitoring the number of client connections
In the TCP mib for SNMP (Simple Network Management Protocol) this information is available http://www.simpleweb.org/ietf/mibs/mibSynHiLite.php?category=IETFmodule=TCP-MIB On Wed, Dec 19, 2012 at 12:22 AM, Michael Kjellman mkjell...@barracuda.comwrote: netstat + cron is your friend at this point in time On Dec 18, 2012, at 8:25 PM, aaron morton aa...@thelastpickle.com wrote: AFAIK the count connections is not exposed. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/12/2012, at 10:37 PM, Tomas Nunez tomas.nu...@groupalia.com wrote: Hi! I want to know how many client connections has each one of my cluster nodes (to check if my load balancing is spreading in a balanced way, to check if increase in the cluster load can be related to an increase in the number of connections, and things like that). I was thinking about going with netstat, counting ESTABLISHED connections to port 9160, but then I thought maybe there is some way in cassandra to get that information (maybe a counter of connections in the JMX?). I've tried installing MX4J and going over all MBeans, but I haven't found any with a promising name, they all seem unrelated to this information. And I can't find anything skimming the manual, so... Can you think a better way than netstat to get this information? Better yet, is there anything similar to Show processlist in mysql? Thanks! -- groupalia.jpg http://es.groupalia.com/ www.groupalia.com http://es.groupalia.com/ Tomàs Núñez IT-Sysprod Tel. + 34 93 159 31 00 Fax. + 34 93 396 18 52 Llull, 95-97, 2º planta, 08005 BarcelonaSkype: tomas.nunez.groupalia tomas.nu...@groupalia.comnombre.apell...@groupalia.com twitter.png Twitter http://twitter.com/#%21/groupaliaes facebook.png Facebook https://www.facebook.com/GroupaliaEspana linkedin.png Linkedin http://www.linkedin.com/company/groupalia -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
Re: thrift client can't add a column back after it was deleted with cassandra-cli?
The cli using microsecond precision your client might be using something else and the insert with lower timestamps are dropped. On Friday, December 21, 2012, Qiaobing Xie qiaobing@gmail.com wrote: Hi, I am developing a thrift client that inserts and removes columns from a column-family (using batch_mutate calls). Everything seems to be working fine - my thrift client can add/retrieve/delete/add back columns as expected... until I manually deleted a column with cassandra-cli. (I was trying to test an error scenario in which my client would discover a missing column and recreated it in the column-family). After I deleted a column from within cassandra-cli manually, my thrift client detected the column of that name missing when it tried to get it. So it tried to recreated a new column with that name along with a bunch of other columns with a batch_mutate call. The call returned normally and the other columns were added/updated, but the one that I manually deleted from cassandra-cli was not added/created in the column family. I tried to restart my client and cassandra-cli but it didn't help. It just seemed that my thrift client could no longer add a column with that name! Finally I destroyed and recreated the whole column-family and the problem went away. Any idea what I did wrong? -Qiaobing
Re: Correct way to design a cassandra database
You could store the order as the first part of a composite string say first picture as A and second as B. To insert one between call it AA. If you shuffle alot the strings could get really long. Might be better to store the order in a separate column. Neither solution mentioned deals with concurrent access well. On Friday, December 21, 2012, Adam Venturella aventure...@gmail.com wrote: One more link that might be helpful. It's a similar system to photo's but instead of Photos/Albums it's Songs/Playlists: http://www.datastax.com/dev/blog/cql3-for-cassandra-experts. It's not exactly 1:1 but it covers related concepts in making it work. On Fri, Dec 21, 2012 at 8:02 AM, Adam Venturella aventure...@gmail.com wrote: Ok.. So here is my latest thinking... Including that index: CREATE TABLE Users ( user_name text, password text, PRIMARY KEY (user_name) ); ^ Same as before CREATE TABLE Photos( user_name text, photo_id uuid, created_time timestamp, data text, PRIMARY KEY (user_name, photo_id, created_time) ) WITH CLUSTERING ORDER BY (created_time DESC); ^ Note the addition of a photo id and using that in the PK def with the created_time Data is a JSON like this: { thumbnail: url, standard_resolution:url } CREATE TABLE PhotosAlbums ( user_name text, album_name text, poster_image_url text, data text PRIMARY KEY (user_name, album_name) ); ^ Same as before, data represents a JSON array of the photos: [{photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}, {photo_id:..., thumbnail:url, standard_resolution:url}] CREATE TABLE PhotosAlbumsIndex ( user_name text, photo_id uuid, album_name text, created_time timestamp PRIMARY KEY (user_name, photo_id, album_name) ); The create_time column here is because you need to have at least 1 column that is not part of the PK. Or that's what it looks like in my quick test. ^ Each photo added to an album needs to be added to this index row As before, your application will need to keep the order of the array in tact as your users modify the order of things. Now however if they delete a photo you need to fetch the PhotoAlbums the photo existed in and update them accordingly: SELECT * FROM PhotosAlbumsIndex WHERE user_name='the_user' AND photo_id=uuid This should return to you all of the albums that the photo was a part of. Now you need to: SELECT * FROM PhotosAlbums where user_name = the_user and album_name IN
Re: State of Cassandra and Java 7
This what versions are supported is kinda up to you for example earlier versions of jdk now have bugs. I have a version of java 1.6.0_23 I believe that will not even start with the latest cassandra releases. Likewise people suggest not running the newest ones 1.7.0 because they have not tested it. So there is not a definitive version that is the best. If you having problems and your version is older someone will say upgrade, if your newest version is not working someone will say downgrade. No one trusts a just released version. Generally this means try to keep a few months behind the curb. As with most things in c* you can run different versions on different nodes, you are not forced into an all at once upgrade. On Sun, Dec 23, 2012 at 4:37 AM, Fabrice Facorat fabrice.faco...@gmail.comwrote: At Orange portails we are presently testing Cassandra 1.2.0 beta/rc with Java 7, and presnetly we have no issues 2012/12/22 Brian Tarbox tar...@cabotresearch.com: What I saw in all cases was a) set JAVA_HOME to java7, run program fail b) set JAVA_HOME to java6, run program success I should have better notes but I'm at a 6 person startup so working tools gets used and failing tools get deleted. Brian On Fri, Dec 21, 2012 at 3:54 PM, Bryan Talbot btal...@aeriagames.com wrote: Brian, did any of your issues with java 7 result in corrupting data in cassandra? We just ran into an issue after upgrading a test cluster from Cassandra 1.1.5 and Oracle JDK 1.6.0_29-b11 to Cassandra 1.1.7 and 7u10. What we saw is values in columns with validation Class=org.apache.cassandra.db.marshal.LongType that were proper integers becoming corrupted so that they become stored as strings. I don't have a reproducible test case yet but will work on making one over the holiday if I can. For example, a column with a long type that was originally written and stored properly (say with value 1200) was somehow changed during cassandra operations (compaction seems the only possibility) to be the value '1200' with quotes. The data was written using the phpcassa library and that application and library haven't been changed. This has only happened on our test cluster which was upgraded and hasn't happened on our live cluster which was not upgraded. Many of our column families were affected and all affected columns are Long (or bigint for cql3). Errors when reading using CQL3 command client look like this: Failed to decode value '1356441225' (for column 'expires') as bigint: unpack requires a string argument of length 8 and when reading with cassandra-cli the error is [default@cf] get token['fbc1e9f7cc2c0c2fa186138ed28e5f691613409c0bcff648c651ab1f79f9600b']; = (column=client_id, value=8ec4c29de726ad4db3f89a44cb07909c04f90932d, timestamp=1355836425784329, ttl=648000) A long is exactly 8 bytes: 10 -Bryan On Mon, Dec 17, 2012 at 7:33 AM, Brian Tarbox tar...@cabotresearch.com wrote: I was using jre-7u9-linux-x64 which was the latest at the time. I'll confess that I did not file any bugs...at the time the advice from both the Cassandra and Zookeeper lists was to stay away from Java 7 (and my boss had had enough of my reporting that the problem was Java 7 for me to spend a lot more time getting the details). Brian On Sun, Dec 16, 2012 at 4:54 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Sat, Dec 15, 2012 at 7:12 PM, Michael Kjellman mkjell...@barracuda.com wrote: What issues have you ran into? Actually curious because we push 1.1.5-7 really hard and have no issues whatsoever. A related question is which which version of java 7 did you try? The first releases of java 7 were apparently famous for having many issues but it seems the more recent updates are much more stable. -- Sylvain On Dec 15, 2012, at 7:51 AM, Brian Tarbox tar...@cabotresearch.com wrote: We've reverted all machines back to Java 6 after running into numerous Java 7 issues...some running Cassandra, some running Zookeeper, others just general problems. I don't recall any other major language release being such a mess. On Fri, Dec 14, 2012 at 5:07 PM, Bill de hÓra b...@dehora.net wrote: At least that would be one way of defining officially supported. Not quite, because, Datastax is not Apache Cassandra. the only issue related to Java 7 that I know of is CASSANDRA-4958, but that's osx specific (I wouldn't advise using osx in production anyway) and it's not directly related to Cassandra anyway so you can easily use the beta version of snappy-java as a workaround if you want to. So that non blocking issue aside, and as far as we know, Cassandra supports Java 7. Is it rock-solid in production? Well, only repeated use in production can tell, and that's not really in the hand of the project. Exactly right. If enough people use Cassandra on
Re: how to create a keyspace in CQL3
Unfortunately one of the first command everyone needs to use to use to work with cassandra changes very often. You can use cqlsh help create_keyspace; But some times even the documentation is not in line. Using this permutation of goodness: cqlsh 2.3.0 | Cassandra 1.2.0-beta2-SNAPSHOT | CQL spec 3.0.0 | Thrift protocol 19.35.0] The syntax is as follows: cqlsh create keyspace a with replication = {'class':'SimpleStrategy', 'replication_factor':3}; On Sun, Dec 23, 2012 at 10:15 AM, Manu Zhang owenzhang1...@gmail.comwrote: I'm wondering why the following command to create a keyspace in CQL3 fails. It is same as the sample in the doc http://cassandra.apache.org/doc/cql3/CQL.html CREATE KEYSPACE demodb WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 1; I'm using Cassandra1.2-beta2
Re: Force data to a specific node
There is a crazy, very bad, don't do it way to do this. You can set RF=1 and hack the LocalPartitioner (because the local partitioner has been made not to do this) Then the node you connect to and write is the node the data will get stored on. Its like memcache do it yourself style sharding. Did I say not suggested. If not not suggested On Wed, Jan 2, 2013 at 2:54 PM, Aaron Turner synfina...@gmail.com wrote: You'd have to use the ordered partitioner or something like that and choose your row key according to the node you want it placed. But that's in general a really bad idea because you end up with unbalanced nodes and hot spots. That said, are your nodes on a LAN? I have my 9+3 node cluster (two datacenters) on 100Mbps ports (which everyone says not to do) and it's working just fine. Even node rebuilds haven't been that bad so far. If you're trying to avoid WAN replication, then use a dedicated cluster. On Wed, Jan 2, 2013 at 10:20 AM, Everton Lima peitin.inu...@gmail.com wrote: We need to do this to minimize the network I/O. We have our own load data balance algorithm. We have some data that is best to process in a local machine. Is it possible? How? 2013/1/2 Edward Sargisson edward.sargis...@globalrelay.net Why would you want to? From: Everton Lima peitin.inu...@gmail.com To: Cassandra-User user@cassandra.apache.org Sent: Wed Jan 02 18:03:49 2013 Subject: Force data to a specific node It is possible to force a data to stay in a specific node? -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Everton Lima Aleixo Bacharel em Ciência da Computação pela UFG Mestrando em Ciência da Computação pela UFG Programador no LUPA -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero
Re: RandomPartitioner to Murmur3Partitioner
By the way 10% faster does not necessarily mean 10% more requests. https://issues.apache.org/jira/browse/CASSANDRA-2975 https://issues.apache.org/jira/browse/CASSANDRA-3772 Also if you follow the tickets My tests show that Murmur3Partitioner actually is worse than MD5 with high cardinality indexes, here is what I did (kernel 3.0.0-19, 2.2Ghz quad-core Opteron, 2GB RAM): For each test: wiped all of the data directories and re-compiled with 'clean' ran stress with -c 50 -C 500 -S 512 -n 5 (where -c is number of columns, -C values cardinality and -S is value size in bytes) 4 times (to make it hot) RandomPartitioner: average op rate is 845. Murmur3Partitioner: average op rage is 721. Then later: I have removed ThreadLocal declaration from the M3P (and cleaned whitespace errors) which was the bottleneck, after re-running tests with that modification M3P beats RP with 903 to 847. 847/903 = 0.937984496 I think that I is% 6 or 7% right?, not 10%, and other things in cassandra are orders or magnitude slower then computing hashes, network, diskio. Also is this test only testing when using 2ndary indexes? What about people who do not care about 2ndard indexes. I am sure it is faster and better, but I am not going to lose sleep until I rebuild all my clusters just to change the partitioner. So new clusters I will probably use the default but not going to upgrade existing ones. Let them stay RP. Edward On Thu, Jan 3, 2013 at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hello I have read the following from the changes.txt file. The default partitioner for new clusters is Murmur3Partitioner, which is about 10% faster for index-intensive workloads. Partitioners cannot be changed once data is in the cluster, however, so if you are switching to the 1.2 cassandra.yaml, you should change this to RandomPartitioner or whatever your old partitioner was. Does this mean that there absolutely no way to switch to the new partitioner for people that are already using Cassandra ? Alain
Re: Error after 1.2.0 upgrade
Just a shot in the dark, but I would try setting -Xss higher then the default. It's probably like 180, but I cant even start at that level, bumped it up to 256 for JDK 7. On Thu, Jan 3, 2013 at 12:02 PM, Michael Kjellman mkjell...@barracuda.comwrote: :) yes, I'm crazy The assertion appears to be compiled code which is why I was guessing jna. Biggest issue right now is that upgraded 1.2.0 nodes only see other 1.2.0 nodes in the ring. 1.1.7 nodes don't see the 1.2.0 nodes.. Upgrading every node to 1.2.0 now lists all nodes in the ring... On Jan 3, 2013, at 8:57 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Wow, so you're going live with 1.2.0, good luck with that. When it's done, whould you mind letting me know if everything went fine or if you have some advice or feedback? This looks related to JNA? Does it ? The only thing logged about JNA is the following : JNA mlockall successful. What does this line *** java.lang.instrument ASSERTION FAILED ***: !errorOutstanding with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806 means? 2013/1/3 Michael Kjellman mkjell...@barracuda.com I'm having huge upgrade issues from 1.1.7 - 1.2.0 atm but in a 12 node cluster which I am slowly massaging into a good state I haven't seen this in 15+ hours of operation… This looks related to JNA? From: Alain RODRIGUEZ arodr...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, January 3, 2013 8:42 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Error after 1.2.0 upgrade In a dev env, C* 1.1.7 - 1.2.0, 1 node. I run Cassandra in a 8GB memory environment. The upgrade went well, but I sometimes have the following error: INFO 17:31:04,143 Node /192.168.100.201 state jump to normal INFO 17:31:04,149 Enqueuing flush of Memtable-local@1654799672(32/32 serialized/live bytes, 2 ops) INFO 17:31:04,149 Writing Memtable-local@1654799672(32/32 serialized/live bytes, 2 ops) INFO 17:31:04,371 Completed flushing /home/stockage/cassandra/data/system/local/system-local-ia-12-Data.db (91 bytes) for commitlog position ReplayPosition(segmentId=1357230649515, position=49584) INFO 17:31:04,376 Startup completed! Now serving reads. INFO 17:31:04,798 Compacted to [/var/lib/cassandra/data/system/local/system-local-ia-13-Data.db,]. 950 to 471 (~49% of original) bytes for 1 keys at 0,000507MB/s. Time: 886ms. INFO 17:31:04,889 mx4j successfuly loaded HttpAdaptor version 3.0.2 started on port 8081 INFO 17:31:04,967 Not starting native transport as requested. Use JMX (StorageService-startNativeTransport()) to start it INFO 17:31:04,980 Binding thrift service to /0.0.0.0:9160 INFO 17:31:05,007 Using TFramedTransport with a max frame size of 15728640 bytes. INFO 17:31:09,964 Using synchronous/threadpool thrift server on 0.0.0.0 : 9160 INFO 17:31:09,965 Listening for thrift clients... *** java.lang.instrument ASSERTION FAILED ***: !errorOutstanding with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806 ERROR 17:33:56,002 Exception in thread Thread[Thrift:1702,5,main] java.lang.StackOverflowError at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:22) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:199) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit:
Re: Error after 1.2.0 upgrade
The only true drain is 1) turn on ip tables to stop all incoming traffic 2) flush 3) wait 4) delete files 5) upgrade 6) restart On Thu, Jan 3, 2013 at 2:59 PM, Michael Kjellman mkjell...@barracuda.comwrote: That's why I didn’t create a ticket as I knew there was one. But, I thought this had been fixed in 1.1.7 ?? From: Edward Capriolo edlinuxg...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, January 3, 2013 11:57 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Error after 1.2.0 upgrade There is a bug on this, drain has been in a weird state for a long time. In 1.0 it did not work labeled as a known limitation. https://issues.apache.org/jira/browse/CASSANDRA-4446 On Thu, Jan 3, 2013 at 2:49 PM, Michael Kjellman mkjell...@barracuda.comwrote: Another thing: for those that use counters this might be a problem. I always do a nodetool drain before upgrading a node (as is good practice btw). However, in every case on every one of my nodes, the commit log was replayed on each node and mutations were created. Could lead to double counting of counters… No bug for that yet Best, Micahel From: Michael Kjellman mkjell...@barracuda.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, January 3, 2013 11:42 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Error after 1.2.0 upgrade Tracking Issues: https://issues.apache.org/jira/browse/CASSANDRA-5101 https://issues.apache.org/jira/browse/CASSANDRA-5104 which was created because of https://issues.apache.org/jira/browse/CASSANDRA-5103 https://issues.apache.org/jira/browse/CASSANDRA-5102 Also friendly reminder to all that cql2 created indexes will not work with cql3. You need to drop them and recreate in cql3, otherwise you'll see rpc_timeout issues. I'll update with more issues as I see them. The fun bugs never happen in your dev environment do they :) From: aaron morton aa...@thelastpickle.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, January 3, 2013 11:38 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Error after 1.2.0 upgrade Michael, Could you share some of your problems ? May be of help for others. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/01/2013, at 5:45 AM, Michael Kjellman mkjell...@barracuda.com wrote: I'm having huge upgrade issues from 1.1.7 - 1.2.0 atm but in a 12 node cluster which I am slowly massaging into a good state I haven't seen this in 15+ hours of operation… This looks related to JNA? From: Alain RODRIGUEZ arodr...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Thursday, January 3, 2013 8:42 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Error after 1.2.0 upgrade In a dev env, C* 1.1.7 - 1.2.0, 1 node. I run Cassandra in a 8GB memory environment. The upgrade went well, but I sometimes have the following error: INFO 17:31:04,143 Node /192.168.100.201 state jump to normal INFO 17:31:04,149 Enqueuing flush of Memtable-local@1654799672(32/32 serialized/live bytes, 2 ops) INFO 17:31:04,149 Writing Memtable-local@1654799672(32/32 serialized/live bytes, 2 ops) INFO 17:31:04,371 Completed flushing /home/stockage/cassandra/data/system/local/system-local-ia-12-Data.db (91 bytes) for commitlog position ReplayPosition(segmentId=1357230649515, position=49584) INFO 17:31:04,376 Startup completed! Now serving reads. INFO 17:31:04,798 Compacted to [/var/lib/cassandra/data/system/local/system-local-ia-13-Data.db,]. 950 to 471 (~49% of original) bytes for 1 keys at 0,000507MB/s. Time: 886ms. INFO 17:31:04,889 mx4j successfuly loaded HttpAdaptor version 3.0.2 started on port 8081 INFO 17:31:04,967 Not starting native transport as requested. Use JMX (StorageService-startNativeTransport()) to start it INFO 17:31:04,980 Binding thrift service to /0.0.0.0:9160 INFO 17:31:05,007 Using TFramedTransport with a max frame size of 15728640 bytes. INFO 17:31:09,964 Using synchronous/threadpool thrift server on 0.0.0.0 : 9160 INFO 17:31:09,965 Listening for thrift clients... *** java.lang.instrument ASSERTION FAILED ***: !errorOutstanding with message transform method call failed at ../../../src/share/instrument/JPLISAgent.c line: 806 ERROR 17:33:56,002 Exception in thread Thread[Thrift:1702,5,main] java.lang.StackOverflowError at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.io.BufferedInputStream.fill(Unknown Source) at java.io.BufferedInputStream.read1(Unknown Source) at java.io.BufferedInputStream.read(Unknown Source) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127
Re: Specifying initial token in 1.2 fails
Yes. They were really just introduced and if you are ready to hitch your wagon to every new feature you put yourself in considerable risk. With any piece of software not just Cassandra. On Fri, Jan 4, 2013 at 11:59 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: But I don't really get the point of starting a new cluster without vnodes... Is there some disadvantage using vnodes ? Alain 2013/1/4 Nick Bailey n...@datastax.com If you are planning on using murmur3 without vnodes (specifying your own tokens) there is a quick python script in the datastax docs you can use to generate balanced tokens. http://www.datastax.com/docs/1.2/initialize/token_generation#calculating-tokens-for-the-murmur3partitioner On Fri, Jan 4, 2013 at 10:53 AM, Michael Kjellman mkjell...@barracuda.com wrote: To be honest I haven't run a cluster with Murmur3. You can still use indexing with RandomPartitioner (all us old folk are stuck on Random btw..) And there was a thread floating around yesterday where Edward did some benchmarks and found that Murmur3 was actually slower than RandomPartitioner. http://www.mail-archive.com/user@cassandra.apache.org/msg26789.htmlhttp://permalink.gmane.org/gmane.comp.db.cassandra.user/30182 I do know that with vnodes token allocation is now 100% dynamic so no need to manually assign tokens to nodes anymore. Best, michael From: Dwight Smith dwight.sm...@genesyslab.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Friday, January 4, 2013 8:48 AM To: 'user@cassandra.apache.org' user@cassandra.apache.org Subject: RE: Specifying initial token in 1.2 fails Michael ** ** Yes indeed – my mistake. Thanks. I can specify RandomPartitioner, since I do not use indexing – yet. ** ** Just for informational purposes – with Murmur3 - to achieve a balanced cluster – is the initial token method supported? If so – how should these be generated, the token-generator seems to only apply to RandomPartitioner. ** ** Thanks again ** ** *From:* Michael Kjellman [mailto:mkjell...@barracuda.commkjell...@barracuda.com] *Sent:* Friday, January 04, 2013 8:39 AM *To:* user@cassandra.apache.org *Subject:* Re: Specifying initial token in 1.2 fails ** ** Murmur3 != MD5 (RandomPartitioner) ** ** *From: *Dwight Smith dwight.sm...@genesyslab.com *Reply-To: *user@cassandra.apache.org user@cassandra.apache.org *Date: *Friday, January 4, 2013 8:36 AM *To: *'user@cassandra.apache.org' user@cassandra.apache.org *Subject: *Specifying initial token in 1.2 fails ** ** Hi Just started evaluating 1.2 – starting a clean Cassandra node – the usual practice is to specify the initial token – but when I attempt to start the node the following is observed: INFO [main] 2013-01-03 14:08:57,774 DatabaseDescriptor.java (line 203) disk_failure_policy is stop DEBUG [main] 2013-01-03 14:08:57,774 DatabaseDescriptor.java (line 205) page_cache_hinting is false INFO [main] 2013-01-03 14:08:57,774 DatabaseDescriptor.java (line 266) Global memtable threshold is enabled at 339MB DEBUG [main] 2013-01-03 14:08:58,008 DatabaseDescriptor.java (line 381) setting auto_bootstrap to true ERROR [main] 2013-01-03 14:08:58,024 DatabaseDescriptor.java (line 495) Fatal configuration error org.apache.cassandra.exceptions.ConfigurationException: For input string: 85070591730234615865843651857942052863 at org.apache.cassandra.dht.Murmur3Partitioner$1.validate(Murmur3Partitioner.java:180) at org.apache.cassandra.config.DatabaseDescriptor.loadYaml(DatabaseDescriptor.java:433) at org.apache.cassandra.config.DatabaseDescriptor.clinit(DatabaseDescriptor.java:121) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:178) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:397) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:440) This looks like a bug. Thanks ** ** -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
Re: help turning compaction..hours of run to get 0% compaction....
There is some point where you simply need more machines. On Mon, Jan 7, 2013 at 5:02 PM, Michael Kjellman mkjell...@barracuda.comwrote: Right, I guess I'm saying that you should try loading your data with leveled compaction and see how your compaction load is. Your work load sounds like leveled will fit much better than size tiered. From: Brian Tarbox tar...@cabotresearch.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Monday, January 7, 2013 1:58 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: help turning compaction..hours of run to get 0% compaction The problem I see is that it already takes me more than 24 hours just to load my data...during which time the logs say I'm spending tons of time doing compaction. For example in the last 72 hours I'm consumed* 20 hours * per machine on compaction. Can I conclude from that than I should be (perhaps drastically) increasing my compaction_mb_per_sec on the theory that I'm getting behind? The fact that it takes me 3 days or more to run a test means its hard to just play with values and see what works best, so I'm trying to understand the behavior in detail. Thanks. Brain On Mon, Jan 7, 2013 at 4:13 PM, Michael Kjellman mkjell...@barracuda.comwrote: http://www.datastax.com/dev/blog/when-to-use-leveled-compaction If you perform at least twice as many reads as you do writes, leveled compaction may actually save you disk I/O, despite consuming more I/O for compaction. This is especially true if your reads are fairly random and don’t focus on a single, hot dataset. From: Brian Tarbox tar...@cabotresearch.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Monday, January 7, 2013 12:56 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: help turning compaction..hours of run to get 0% compaction I have not specified leveled compaction so I guess I'm defaulting to size tiered? My data (in the column family causing the trouble) insert once, ready many, update-never. Brian On Mon, Jan 7, 2013 at 3:13 PM, Michael Kjellman mkjell...@barracuda.com wrote: Size tiered or leveled compaction? From: Brian Tarbox tar...@cabotresearch.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Monday, January 7, 2013 12:03 PM To: user@cassandra.apache.org user@cassandra.apache.org Subject: help turning compaction..hours of run to get 0% compaction I have a column family where I'm doing 500 inserts/sec for 12 hours or so at time. At some point my performance falls off a cliff due to time spent doing compactions. I'm seeing row after row of logs saying that after 1 or 2 hours of compactiing it reduced to 100% of 99% of the original. I'm trying to understand what direction this data points me to in term of configuration change. a) increase my compaction_throughput_mb_per_sec because I'm falling behind (am I falling behind?) b) enable multi-threaded compaction? Any help is appreciated. Brian -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
Re: about validity of recipe A node join using external data copy methods
Basically this recipe is from the old days when we had anti-compaction. Now streaming is very efficient rarely fails and there is no need to do it this way anymore. This recipe will be abolished from the second edition. It still likely works except when using counters. Edward On Tue, Jan 8, 2013 at 7:27 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, Edward Capriolo described in his Cassandra book a faster way [1] to start new nodes if the cluster size doubles, from N to 2 *N. It's about splitting in 2 parts each token range taken in charge, after the split, with 2 nodes: the existing one, and a new one. And for starting a new node, one needs to: - copy the data records from the corresponding node (without the system records) - start the new node with auto_bootstrap: false This raises 2 questions: A) is this recipe still valid with v1.1 and v1.2 ? B) do we still need to start the new node with auto_bootstrap: false ? My guess is yes as the happening of the bootstrap phase is not recorded into the data records. Thanks. Dominique [1] see recipe A node join using external data copy methods, page 165
Re: about validity of recipe A node join using external data copy methods
It has been true since about 0.8. in the old days ANTI-COMPACTION stunk and many weird errors would cause node joins to have to be retried N times. Now node moves/joins seem to work near 100% of the time (in 1.0.7) they are also very fast and efficient. If you want to move a node to new hardware you can do it with rsync, but I would not use the technique for growing the cluster. It is error prone, and ends up being more work. On Tue, Jan 8, 2013 at 10:57 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Now streaming is very efficient rarely fails and there is no need to do it this way anymore I guess it's true in v1.2. Is it true also in v1.1 ? Thanks. Dominique *De :* Edward Capriolo [mailto:edlinuxg...@gmail.com] *Envoyé :* mardi 8 janvier 2013 16:01 *À :* user@cassandra.apache.org *Objet :* Re: about validity of recipe A node join using external data copy methods Basically this recipe is from the old days when we had anti-compaction. Now streaming is very efficient rarely fails and there is no need to do it this way anymore. This recipe will be abolished from the second edition. It still likely works except when using counters. Edward On Tue, Jan 8, 2013 at 7:27 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, Edward Capriolo described in his Cassandra book a faster way [1] to start new nodes if the cluster size doubles, from N to 2 *N. It's about splitting in 2 parts each token range taken in charge, after the split, with 2 nodes: the existing one, and a new one. And for starting a new node, one needs to: - copy the data records from the corresponding node (without the system records) - start the new node with auto_bootstrap: false This raises 2 questions: A) is this recipe still valid with v1.1 and v1.2 ? B) do we still need to start the new node with auto_bootstrap: false ? My guess is yes as the happening of the bootstrap phase is not recorded into the data records. Thanks. Dominique [1] see recipe A node join using external data copy methods, page 165
Re: Wide rows in CQL 3
I ask myself this every day. CQL3 is new way to do things, including wide rows with collections. There is no upgrade path. You adopt CQL3's sparse tables as soon as you start creating column families from CQL. There is not much backwards compatibility. CQL3 can query compact tables, but you may have to remove the metadata from them so they can be transposed. Thrift can not write into CQL tables easily, because of how the primary keys and column names are encoded into the key column and compact metadata is not equal to cql3's metadata. http://www.datastax.com/dev/blog/thrift-to-cql3 For a large swath of problems I like how CQL3 deals with them. For example you do not really need CQL3 to store a collection in a column family along side other data. You can use wide rows for this, but the integrated solution with CQL3 metadata is interesting. My biggest beefs are: 1) column names are UTF8 (seems wasteful in most cases) 2) sparse empty row to ghost (seems like tiny rows with one column have much overhead now) 3) using composites (with (compound primary keys) in some table designs) is wasteful. Composite adds two unsigned bytes for size and one unsigned byte as 0 per part. 4) many lines of code between user/request and actual disk. (tracing a CQL select VS a slice, young gen, etc) 5) not sure if collections can be used in REALLY wide row scenarios. aka 1,000,000 entry set? I feel that in an effort to be nube friendly, sparse+CQL is presented as the best default option. However the 5 above items are not minor, and in several use cases could make CQL's sparse tables a bad choice for certain applications. Those users would get better performance from compact storage. I feel that message sometimes gets washed away in all the CQL coolness. What is that you say? This is not actually the most efficient way to store this data? Well who cares I can do an IN CLAUSE! WooHoo! On Wed, Jan 9, 2013 at 12:10 PM, Ben Hood 0x6e6...@gmail.com wrote: I'm currently in the process of porting my app from Thrift to CQL3 and it seems to me that the underlying storage layout hasn't really changed fundamentally. The difference appears to be that CQL3 offers a neater abstraction on top of the wide row format. For example, in CQL3, your query results are bound to a specific schema, so you get named columns back - previously you had to process the slices procedurally. The insert path appears to be tighter as well - you don't seem to get away with leaving out key attributes. I'm sure somebody more knowledgeable can explain this better though. Cheers, Ben On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome mrevilgn...@gmail.com wrote: We use the thrift bindings for our current production cluster, so I haven't been tracking the developments regarding CQL3. I just discovered when speaking to another potential DSE customer that wide rows, or rather columns not defined in the metadata aren't supported in CQL 3. I'm curious to understand the reasoning behind this, whether this is an intentional direction shift away from the big table paradigm, and what's supposed to happen to those of us who have already bought into C* specifically because of the wide row support. What is our upgrade path?
Re: Wide rows in CQL 3
By no upgrade path I mean to say if I have a table with compact storage I can not upgrade it to sparse storage. If i have an existing COMPACT table and I want to add a Map to it, I can not. This is what I mean by no upgrade path. Column families that mix static and dynamic columns are pretty common. In fact it is pretty much the default case, you have a default validator then some columns have specific validators. In the old days people used to say You only need one column family you would subdivide your row key into parts username=username, password=password, friend-friene = friends, pet-pets = pets. It's very efficient and very easy if you understand what a slice is. Is everyone else just adding a column family every time they have new data? :) Sounds very un-no-sql-like. Most people are probably going to store column names as tersely as possible. Your not going to store password as a multibyte UTF8(password). You store it as ascii(password). (or really ascii('pw') Also for the rest of my comment I meant that the comparator of any sparse tables always seems to be a COMPOSITE even if it is only one part (last I checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a compact table it is -colname- My overarching point is the 5 things I listed do have a cost, the user by default gets sparse storage unless they are smart enough to know they do not want it. This is naturally going to force people away from compact storage. Basically for any column family: two possible decision paths: 1) use compact 2) use sparse Other then ease of use why would I chose sparse? Why should it be the default? On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.comwrote: c way. Now I can't pretend knowing what every user is doing, but from my experience and what I've seen, this is not such a common thing and CF are either static or dynamic in nature, not both.
Re: Wide rows in CQL 3
Also I have to say I do not get that blank sparse column. Ghost ranges are a little weird but they don't bother me. 1 its a row of nothing. The definition of a waste. 2 suppose of have 1 billion rows and my distribution is mostly rows of 1 or 2 columns. My database is now significantly bigger. That stinks. 3 suppose I write columns frequently. Well do I have to constantly need to keep writing this sparse empty row? It seems like I would. Worst case each stable with a write to a rowkey also has this sparse column, meaning multiple blank empty wasteful columns on disk to solve ghosts, that do not bother me anyway. 4 are these sparse columns also taking memtable space? This questions would give me serious pause to use sparse tables On Wednesday, January 9, 2013, Edward Capriolo edlinuxg...@gmail.com wrote: By no upgrade path I mean to say if I have a table with compact storage I can not upgrade it to sparse storage. If i have an existing COMPACT table and I want to add a Map to it, I can not. This is what I mean by no upgrade path. Column families that mix static and dynamic columns are pretty common. In fact it is pretty much the default case, you have a default validator then some columns have specific validators. In the old days people used to say You only need one column family you would subdivide your row key into parts username=username, password=password, friend-friene = friends, pet-pets = pets. It's very efficient and very easy if you understand what a slice is. Is everyone else just adding a column family every time they have new data? :) Sounds very un-no-sql-like. Most people are probably going to store column names as tersely as possible. Your not going to store password as a multibyte UTF8(password). You store it as ascii(password). (or really ascii('pw') Also for the rest of my comment I meant that the comparator of any sparse tables always seems to be a COMPOSITE even if it is only one part (last I checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a compact table it is -colname- My overarching point is the 5 things I listed do have a cost, the user by default gets sparse storage unless they are smart enough to know they do not want it. This is naturally going to force people away from compact storage. Basically for any column family: two possible decision paths: 1) use compact 2) use sparse Other then ease of use why would I chose sparse? Why should it be the default? On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.com wrote: c way. Now I can't pretend knowing what every user is doing, but from my experience and what I've seen, this is not such a common thing and CF are either static or dynamic in nature, not both.
Re: Starting Cassandra
I think 1.6.0_24 is too low and 1.7.0 is too high. Try a more recent 1.6. I just had problems with 1.6.0_23 see here: https://issues.apache.org/jira/browse/CASSANDRA-4944 On Thu, Jan 10, 2013 at 10:29 AM, Sloot, Hans-Peter hans-peter.sl...@atos.net wrote: I have 4 vm's with 1024M memory. 1 cpu. -Origineel bericht- Van: Andrea Gazzarini Verz.: 10-01-2013, 16:24 Aan: user@cassandra.apache.org Onderwerp: Re: Starting Cassandra Hi, I'm running Cassandra with 1.6_24 and all it's working, so probably the problem is elsewhere. What about your hardware / SO configuration? On 01/10/2013 04:19 PM, Sloot, Hans-Peter wrote: The java version is 1.6_24. The manual said that 1.7 was not the best choice. But I will try it. -Origineel bericht- Van: adeel.ak...@panasiangroup.com Verz.: 10-01-2013, 16:08 Aan: user@cassandra.apache.org; Sloot, Hans-Peter CC: user@cassandra.apache.org Onderwerp: Re: Starting Cassandra Hi, Please check java version with (java -version) command and install java 7 to resolve this issue. Regards, Adeel Akbar Quoting Sloot, Hans-Peter hans-peter.sl...@atos.net: Hello, Can someone help me out? I have installed Cassandra enterprise and followed the cookbook - Configured the cassandra.yaml file - Configured the cassandra-topoloy.properties file But when I try to start the cluster with 'service dse start' nothing starts. With cassandra -f I get: /usr/sbin/cassandra -f xss = -ea -javaagent:/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms495M -Xmx495M -Xmn100M -XX:+HeapDumpOnOutOfMemoryError -Xss180k Segmentation fault The command cassandra -v I get : xss = -ea -javaagent:/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms495M -Xmx495M -Xmn100M -XX:+HeapDumpOnOutOfMemoryError -Xss180k 1.1.6-dse-p1 Regards Hans-Peter Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken wij u dit onmiddellijk aan ons te melden en het bericht te vernietigen. Aangezien de integriteit van het bericht niet veilig gesteld is middels verzending via internet, kan Atos Nederland B.V. niet aansprakelijk worden gehouden voor de inhoud daarvan. Hoewel wij ons inspannen een virusvrij netwerk te hanteren, geven wij geen enkele garantie dat dit bericht virusvrij is, noch aanvaarden wij enige aansprakelijkheid voor de mogelijke aanwezigheid van een virus in dit bericht. Op al onze rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos Nederland B.V. goederen en/of diensten levert zijn met uitsluiting van alle andere voorwaarden de Leveringsvoorwaarden van Atos Nederland B.V. van toepassing. Deze worden u op aanvraag direct kosteloos toegezonden. This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Atos Nederland B.V. group liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted. On all offers and agreements under which Atos Nederland B.V. supplies goods and/or services of whatever nature, the Terms of Delivery from Atos Nederland B.V. exclusively apply. The Terms of Delivery shall be promptly submitted to you on your request. Atos Nederland B.V. / Utrecht KvK Utrecht 30132762 Dit bericht is vertrouwelijk en kan geheime informatie bevatten enkel bestemd voor de geadresseerde. Indien dit bericht niet voor u is bestemd, verzoeken wij u dit onmiddellijk aan ons te melden en het bericht te vernietigen. Aangezien de integriteit van het bericht niet veilig gesteld is middels verzending via internet, kan Atos Nederland B.V. niet aansprakelijk worden gehouden voor de inhoud daarvan. Hoewel wij ons inspannen een virusvrij netwerk te hanteren, geven wij geen enkele garantie dat dit bericht virusvrij is, noch aanvaarden wij enige aansprakelijkheid voor de mogelijke aanwezigheid van een virus in dit bericht. Op al onze rechtsverhoudingen, aanbiedingen en overeenkomsten waaronder Atos Nederland B.V. goederen en/of diensten levert zijn met uitsluiting van alle andere voorwaarden de Leveringsvoorwaarden van Atos Nederland B.V. van toepassing. Deze worden u op aanvraag direct kosteloos toegezonden. This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy
Re: trying to use row_cache (b/c we have hot rows) but nodetool info says zero requests
You have to change the column family cache info from keys_only to otherwise the cache will not br on for this cf. On Wednesday, January 16, 2013, Brian Tarbox tar...@cabotresearch.com wrote: We have quite wide rows and do a lot of concentrated processing on each row...so I thought I'd try the row cache on one node in my cluster to see if I could detect an effect of using it. The problem is that nodetool info says that even with a two gig row_cache we're getting zero requests. Since my client program is actively processing, and since keycache shows lots of activity I'm puzzled. Shouldn't any read of a column cause the entire row to be loaded? My entire data file is only 32 gig right now so its hard to imagine the 2 gig is too small to hold even a single row? Any suggestions how to proceed are appreciated. Thanks. Brian Tarbox
Re: Starting Cassandra
I think at this point cassandra startup scripts should reject versions since cassandra won't even star with many jvms at this point. On Tuesday, January 15, 2013, Michael Kjellman mkjell...@barracuda.com wrote: Do yourself a favor and get a copy of the Oracle 7 JDK (now with more security patches too!) On Jan 15, 2013, at 1:44 AM, Sloot, Hans-Peter hans-peter.sl...@atos.net wrote: I managed to install apache-cassandra-1.2.0-bin.tar.gz With java-1.6.0-openjdk-1.6.0.0-1.45.1.11.1.el6.x86_64 I still get the segmentation fault. However with java-1.7.0-openjdk-1.7.0.3-2.1.0.1.el6.7.x86_64 everything runs fine. Regards Hans-Peter From: aaron morton [mailto:aa...@thelastpickle.com] Sent: dinsdag 15 januari 2013 1:20 To: user@cassandra.apache.org Subject: Re: Starting Cassandra DSE includes hadoop files. It looks like the installation is broken. I would start again if possible and/or ask the peeps at Data Stax about your particular OS / JVM configuration. In the past I've used this to set a particular JVM when multiple ones are installed… update-alternatives --set java /usr/lib/jvm/java-6-sun/jre/bin/java Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/01/2013, at 10:55 PM, Sloot, Hans-Peter hans-peter.sl...@atos.net wrote: Hi, I removed the open-jdk packages which caused the dse* packages to be uninstalled too and installed jdk6u38. But when I installed the dse packages yum also downloaded and installed the open-jdk packages. -- Join Barracuda Networks in the fight against hunger. To learn how you can help in your community, please visit: http://on.fb.me/UAdL4f
Re: Cassandra Consistency problem with NTP
If you have 40ms NTP drift something is VERY VERY wrong. You should have a local NTP server on the same subnet, do not try to use one on the moon. On Thu, Jan 17, 2013 at 4:42 AM, Sylvain Lebresne sylv...@datastax.comwrote: So what I want is, Cassandra provide some information for client, to indicate A is stored before B, e.g. global unique timestamp, or row order. The row order is determined by 1) the comparator you use for the column family and 2) the column names you, the client, choose for A and B. So what are the column names you use for A and B? Now what you could do is use a TimeUUID comparator for that column family and use a time uuid for A and B column names. In that case, provided A and B are sent from the same client node and B is sent after A on that client (which you said is the case), then any non buggy time uuid generator will guarantee that the uuid generated for A will be smaller than the one for B and thus that in Cassandra, A will be sorted before B. In any case, the point I want to make is that Cassandra itself cannot do anything for you problem, because by design the row ordering is something entirely controlled client side (and just so there is no misunderstanding, I want to make that point not because I'm not trying to suggest you were wrong asking this mailing list, but because we can't suggest a proper solution unless we clearly understand what the problem is). -- Sylvain 2013/1/17 Sylvain Lebresne sylv...@datastax.com I'm not sure I fully understand your problem. You seem to be talking of ordering the requests, in the order they are generated. But in that case, you will rely on the ordering of columns within whatever row you store request A and B in, and that order depends on the column names, which in turns is client provided and doesn't depend at all of the time synchronization of the cluster nodes. And since you are able to say that request A comes before B, I suppose this means said requests are generated from the same source. In which case you just need to make sure that the column names storing each request respect the correct ordering. The column timestamps Cassandra uses are here to which update *to the same column* is the more recent one. So it only comes into play if you requests A and B update the same column and you're interested in knowing which one of the update will win when you read. But even if that's your case (which doesn't sound like it at all from your description), the column timestamp is only generated server side if you use CQL. And even in that latter case, it's a convenience and you can force a timestamp client side if you really wish. In other words, Cassandra dependency on time synchronization is not a strong one even in that case. But again, that doesn't seem at all to be the problem you are trying to solve. -- Sylvain On Thu, Jan 17, 2013 at 2:56 AM, Jason Tang ares.t...@gmail.com wrote: Hi I am using Cassandra in a message bus solution, the major responsibility of cassandra is recording the incoming requests for later consumming. One strategy is First in First out (FIFO), so I need to get the stored request in reversed order. I use NTP to synchronize the system time for the nodes in the cluster. (4 nodes). But the local time of each node are still have some inaccuracy, around 40 ms. The consistency level is write all and read one, and replicate factor is 3. But here is the problem: A request come to node One at local time PM 10:00:01.000 B request come to node Two at local time PM 10:00:00.980 The correct order is A -- B But the timestamp is B -- A So is there any way for Cassandra to keep the correct order for read operation? (e.g. logical timestamp ?) Or Cassandra strong depence on time synchronization solution? BRs //Tang
Re: Cassandra Performance Benchmarking.
Wow you managed to do a load test through the cassandra-cli. There should be a merit badge for that. You should use the built in stress tool or YCSB. The CLI has to do much more string conversion then a normal client would and it is not built for performance. You will definitely get better numbers through other means. On Thu, Jan 17, 2013 at 2:10 PM, Pradeep Kumar Mantha pradeep...@gmail.comwrote: Hi, I am trying to maximize execution of the number of read queries/second. Here is my cluster configuration. Replication - Default 12 Data Nodes. 16 Client Nodes - used for querying. Each client node executes 32 threads - each thread executes 76896 read queries using cassandra-cli tool. i.e all the read queries are stored in a file and that file is given to cassandra-cli tool ( using -f option ) which is executed by a thread. so, total number of queries for 16 client Nodes is 16 * 32 * 76896. The read queries on each client node submitted at the same time. The time taken for 16 * 32 * 76896 read queries is nearly 742 seconds - which is nearly 53k transactions/second. I would like to know if there is any other way/tool through which I can improve the number of transactions/second. Is the performance affected by cassandra-cli tool? thanks pradeep
Re: Key-hash based node selection
You can not be /mostly/ consistent readlike you can not be half-pregnant or half transactional. You either are or you are not. If you do not have enough nodes for a QUORUM the read fails. Thus you never get stale reads you only get failed reads. The dynamic snitch makes reads sticky at READ.ONE. Until a node crosses the badness_threshold,reads should be routed to the same node. (first natural endpoint). This is not a guarantee as each node is keeping snitch scores and routing requests based on its view of the scores. So at READ.ONE you could argue that Cassandra is mostly consistent based on your definition. On Fri, Jan 18, 2013 at 7:23 PM, Timothy Denike ti...@circuitboy.orgwrote: /mostly/ consistent reads
Re: Is this how to read the output of nodetool cfhistograms?
This was described in good detail here: http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ On Tue, Jan 22, 2013 at 9:41 AM, Brian Tarbox tar...@cabotresearch.comwrote: Thank you! Since this is a very non-standard way to display data it might be worth a better explanation in the various online documentation sets. Thank you again. Brian On Tue, Jan 22, 2013 at 9:19 AM, Mina Naguib mina.nag...@adgear.comwrote: On 2013-01-22, at 8:59 AM, Brian Tarbox tar...@cabotresearch.com wrote: The output of this command seems to make no sense unless I think of it as 5 completely separate histograms that just happen to be displayed together. Using this example output should I read it as: my reads all took either 1 or 2 sstable. And separately, I had write latencies of 3,7,19. And separately I had read latencies of 2, 8,69, etc? In other words...each row isn't really a row...i.e. on those 16033 reads from a single SSTable I didn't have 0 write latency, 0 read latency, 0 row size and 0 column count. Is that right? Correct. A number in any of the metric columns is a count value bucketed in the offset on that row. There are no relationships between other columns on the same row. So your first row says 16033 reads were satisfied by 1 sstable. The other metrics (for example, latency of these reads) is reflected in the histogram under Read Latency, under various other bucketed offsets. Offset SSTables Write Latency Read Latency Row Size Column Count 1 16033 00 0 0 2303 00 0 1 3 0 00 0 0 4 0 00 0 0 5 0 00 0 0 6 0 00 0 0 7 0 00 0 0 8 0 02 0 0 10 0 00 0 6261 12 0 02 0 117 14 0 08 0 0 17 0 3 69 0 255 20 0 7 163 0 0 24 019 1369 0 0
Re: Large commit log reasons
By default Cassandra uses 1/3rd heap size for memtable storage. If you make sure memtables smaller they should flush faster and you commit logs should not grow large. Large commit logs are not a problem, some use cases that write to some Column Families more then other can make the commit log directory grow. Basically the commit log does not get removed until everything in it is flushed. We have a nagios alarm on ours, if it hits 8GB something is wrong, but again large commit log is normal and I would not worry. Edward On Wed, Jan 23, 2013 at 10:42 AM, vhmolinar vhmoli...@gmail.com wrote: Hi fellows. I current have 3 nodes cluster running with a replication factor of 1. It's a pretty simple deployment and all my enforcements are focused in writes rather than reads. Actually I'm noticing that my commit log size is always very big if compared to the ammout of data being persisted(which varies on 5gb). So, that lead me to three doubts: 1- When a commit log gets bigger, does it mean that cassandra hasnt processed yet those writes? 2- How could I speed my flushes to sstables? 3- Does my commit log decrease as much as my sstable increases? Is it a rule? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Large-commit-log-reasons-tp7584964.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Large commit log reasons
1. The commit log is only read on startup. W: If writes are unflushed then the commit logs need to be replayed 2: shrink the memtable settings. but you dont want to do this. 3. Commit log size is not directly related to sstable size. E.g. if you write the same row a billion times the commit log size will be large but the sstable will be 1 row. On Wed, Jan 23, 2013 at 11:10 AM, vhmolinar vhmoli...@gmail.com wrote: W
Re: Issue when deleting Cassandra rowKeys.
Make sure the timestamp on your delete is then timestamp of the data. On Sat, Jan 26, 2013 at 1:33 PM, Kasun Weranga kas...@wso2.com wrote: Hi all, When I delete some rowkeys programmatically I can see two rowkeys remains in the column family. I think it is due to tombstones. Is there a way to remove it when deleting rowkeys. Can I run compaction programmatically after deletion? will it remove all these remaining rowkeys. Thanks, Kasun.
Re: Denormalization
One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes. On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik
Re: Denormalization
When I said that writes were cheap, I was speaking that in a normal case people are making 2-10 inserts what in a relational database might be one. 30K inserts is certainly not cheap. Your use case with 30,000 inserts is probably a special case. Most directory services that I am aware of OpenLDAP, Active Directory, Sun Directory server do eventually consistent master/slave and multi-master replication. So no worries about having to background something. You just want the replication to be fast enough so that when you call the employee about to be fired into the office, that by the time he leaves and gets home he can not VPN to rm -rf / your main file server :) On Sun, Jan 27, 2013 at 7:57 PM, Hiller, Dean dean.hil...@nrel.gov wrote: Sometimes this is true, sometimes not…..….We have a use case where we have an admin tool where we choose to do this denorm for ACL on permission checks to make permission checks extremely fast. That said, we have one issue with one object that too many children(30,000) so when someone gives a user access to this one object with 30,000 children, we end up with a bad 60 second wait and users ended up getting frustrated and trying to cancel(our plan since admin activity hardly ever happens is to do it on our background thread and just return immediately to the user and tell him his changes will take affect in 1 minute ). After all, admin changes are infrequent anyways. This example demonstrates how sometimes it could almost burn you. I guess my real point is it really depends on your use cases ;). In a lot of cases denorm can work but in some cases it burns you so you have to balance it all. In 90% of our cases our denorm is working great and for this one case, we need to background the permission change as we still LOVE the performance of our ACL checks. Ps. 30,000 writes in cassandra is not cheap when done from one server ;) but in general parallized writes is very fast for like 500. Later, Dean From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, January 27, 2013 5:50 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Denormalization One technique is on the client side you build a tool that takes the even and produces N mutations. In c* writes are cheap so essentially, re-write everything on all changes. On Sun, Jan 27, 2013 at 4:03 PM, Fredrik Stigbäck fredrik.l.stigb...@sitevision.semailto:fredrik.l.stigb...@sitevision.se wrote: Hi. Since denormalized data is first-class citizen in Cassandra, how to handle updating denormalized data. E.g. If we have a USER cf with name, email etc. and denormalize user data into many other CF:s and then update the information about a user (name, email...). What is the best way to handle updating those user data properties which might be spread out over many cf:s and many rows? Regards /Fredrik