Re: yet a couple more questions on composite columns
Yiming, I am using 2 CF's. Performance wise this should not be an issue. I use it for small files data store. My 2 CF's are: FilesMeta FilesData 2012/2/5 Yiming Sun yiming@gmail.com Interesting idea, Jim. Is there a reason you don't you use metadata:{accountId} instead? For performance reasons? On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona j...@anconafamily.com wrote: I've used special values which still comply with the Composite schema for the metadata columns, e.g. a column of 1970-01-01:{accountId} for a metadata column where the Composite is DateType:UTF8Type. Jim On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun yiming@gmail.com wrote: Thanks Andrey and Chris. It sounds like we don't necessarily have to use composite columns. From what I understand about dynamic CF, each row may have completely different data from other rows; but in our case, the data in each row is similar to other rows; my concern was more about the homogeneity of the data between columns. In our original supercolumn-based schema, one special supercolumn is called metadata which contains a number of subcolumns to hold metadata describing each collection (e.g. number of documents, etc.), then the rest of the supercolumns in the same row are all IDs of documents belong to the collection, and for each document supercolumn, the subcolumns contain the document content as well as metadata on individual document (e.g. checksum of each document). To move away from the supercolumn schema, I could either create two CFs, one to hold metadata, the other document content; or I could create just one CF mixing metadata and doc content in the same row, and using composite column names to identify if the particular column is metadata or a document. I am just wondering if you have any inputs on the pros and cons of each schema. -- Y. On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken chrisger...@mindspring.com wrote: On 4 February 2012 06:21, Yiming Sun yiming@gmail.com wrote: I cannot have one composite column name with 3 components while another with 4 components? Just put 4 components and left last empty (if it is same type)?! Another question I have is how flexible composite columns actually are. If my data model has a CF containing US zip codes with the following composite columns: {OH:Spring Field} : 45503 {OH:Columbus} : 43085 {FL:Spring Field} : 32401 {FL:Key West} : 33040 I know I can ask cassandra to give me the zip codes of all cities in OH. But can I ask it to give me the zip codes of all cities named Spring Field using this model? Thanks. No. You set first composite component at first. I'd use a dynamic CF: row key = state abbreviation column name = city name column value = zip code (or a complex object, one of whose properties is zip code) you can iterate over the columns in a single row to get a state's city names and their zip code and you can do a get_range_slices on all keys for the columns starting and ending on the city name to find out the zip codes for a cities with the given name. I think - Chris
Re: yet a couple more questions on composite columns
Thanks R.V.!! We are also dealing with many small files, so this sounds really promising. -- Y. On Sun, Feb 5, 2012 at 9:59 AM, R. Verlangen ro...@us2.nl wrote: Yiming, I am using 2 CF's. Performance wise this should not be an issue. I use it for small files data store. My 2 CF's are: FilesMeta FilesData 2012/2/5 Yiming Sun yiming@gmail.com Interesting idea, Jim. Is there a reason you don't you use metadata:{accountId} instead? For performance reasons? On Sat, Feb 4, 2012 at 6:24 PM, Jim Ancona j...@anconafamily.com wrote: I've used special values which still comply with the Composite schema for the metadata columns, e.g. a column of 1970-01-01:{accountId} for a metadata column where the Composite is DateType:UTF8Type. Jim On Sat, Feb 4, 2012 at 2:13 PM, Yiming Sun yiming@gmail.com wrote: Thanks Andrey and Chris. It sounds like we don't necessarily have to use composite columns. From what I understand about dynamic CF, each row may have completely different data from other rows; but in our case, the data in each row is similar to other rows; my concern was more about the homogeneity of the data between columns. In our original supercolumn-based schema, one special supercolumn is called metadata which contains a number of subcolumns to hold metadata describing each collection (e.g. number of documents, etc.), then the rest of the supercolumns in the same row are all IDs of documents belong to the collection, and for each document supercolumn, the subcolumns contain the document content as well as metadata on individual document (e.g. checksum of each document). To move away from the supercolumn schema, I could either create two CFs, one to hold metadata, the other document content; or I could create just one CF mixing metadata and doc content in the same row, and using composite column names to identify if the particular column is metadata or a document. I am just wondering if you have any inputs on the pros and cons of each schema. -- Y. On Fri, Feb 3, 2012 at 10:27 PM, Chris Gerken chrisger...@mindspring.com wrote: On 4 February 2012 06:21, Yiming Sun yiming@gmail.com wrote: I cannot have one composite column name with 3 components while another with 4 components? Just put 4 components and left last empty (if it is same type)?! Another question I have is how flexible composite columns actually are. If my data model has a CF containing US zip codes with the following composite columns: {OH:Spring Field} : 45503 {OH:Columbus} : 43085 {FL:Spring Field} : 32401 {FL:Key West} : 33040 I know I can ask cassandra to give me the zip codes of all cities in OH. But can I ask it to give me the zip codes of all cities named Spring Field using this model? Thanks. No. You set first composite component at first. I'd use a dynamic CF: row key = state abbreviation column name = city name column value = zip code (or a complex object, one of whose properties is zip code) you can iterate over the columns in a single row to get a state's city names and their zip code and you can do a get_range_slices on all keys for the columns starting and ending on the city name to find out the zip codes for a cities with the given name. I think - Chris
Re: Restart cassandra every X days?
Close enough for me :) A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/02/2012, at 8:39 PM, R. Verlangen wrote: Well, it seems it's balancing itself, 24 hours later the ring looks like this: ***.89datacenter1 rack1 Up Normal 7.36 GB 50.00% 0 ***.135datacenter1 rack1 Up Normal 8.84 GB 50.00% 85070591730234615865843651857942052864 Looks pretty normal, right? 2012/2/2 aaron morton aa...@thelastpickle.com Speaking technically, that ain't right. I would: * Check if node .135 is holding a lot of hints. * Take a look on disk and see what is there. * Go through a repair and compact on each node. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 2/02/2012, at 9:55 PM, R. Verlangen wrote: Yes, I already did a repair and cleanup. Currently my ring looks like this: Address DC RackStatus State LoadOwns Token ***.89datacenter1 rack1 Up Normal 2.44 GB 50.00% 0 ***.135datacenter1 rack1 Up Normal 6.99 GB 50.00% 85070591730234615865843651857942052864 It's not really a problem, but I'm still wondering why this happens. 2012/2/1 aaron morton aa...@thelastpickle.com Do you mean the load in nodetool ring is not even, despite the tokens been evenly distributed ? I would assume this is not the case given the difference, but it may be hints given you have just done an upgrade. Check the system using nodetool cfstats to see. They will eventually be delivered and deleted. More likely you will want to: 1) nodetool repair to make sure all data is distributed then 2) nodetool cleanup if you have changed the tokens at any point finally Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 31/01/2012, at 11:56 PM, R. Verlangen wrote: After running 3 days on Cassandra 1.0.7 it seems the problem has been solved. One weird thing remains, on our 2 nodes (both 50% of the ring), the first's usage is just over 25% of the second. Anyone got an explanation for that? 2012/1/29 aaron morton aa...@thelastpickle.com Yes but… For every upgrade read the NEWS.TXT it will go through the upgrade procedure in detail. If you want to feel extra smart scan through the CHANGES.txt to get an idea of whats going on. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/01/2012, at 4:14 AM, Maxim Potekhin wrote: Sorry if this has been covered, I was concentrating solely on 0.8x -- can I just d/l 1.0.x and continue using same data on same cluster? Maxim On 1/28/2012 7:53 AM, R. Verlangen wrote: Ok, seems that it's clear what I should do next ;-) 2012/1/28 aaron morton aa...@thelastpickle.com There are no blockers to upgrading to 1.0.X. A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/01/2012, at 7:48 AM, R. Verlangen wrote: Ok. Seems that an upgrade might fix these problems. Is Cassandra 1.x.x stable enough to upgrade for, or should we wait for a couple of weeks? 2012/1/27 Edward Capriolo edlinuxg...@gmail.com I would not say that issuing restart after x days is a good idea. You are mostly developing a superstition. You should find the source of the problem. It could be jmx or thrift clients not closing connections. We don't restart nodes on a regiment they work fine. On Thursday, January 26, 2012, Mike Panchenko m...@mihasya.com wrote: There are two relevant bugs (that I know of), both resolved in somewhat recent versions, which make somewhat regular restarts beneficial https://issues.apache.org/jira/browse/CASSANDRA-2868 (memory leak in GCInspector, fixed in 0.7.9/0.8.5) https://issues.apache.org/jira/browse/CASSANDRA-2252 (heap fragmentation due to the way memtables used to be allocated, refactored in 1.0.0) Restarting daily is probably too frequent for either one of those problems. We usually notice degraded performance in our ancient cluster after ~2 weeks w/o a restart. As Aaron mentioned, if you have plenty of disk space, there's no reason to worry about cruft sstables. The size of your active set is what matters, and you can determine if that's getting too big by watching for iowait (due to reads from the data partition) and/or paging activity of the java process. When you hit that problem, the solution is to 1. try to tune your caches and 2. add more nodes to spread the load. I'll reiterate - looking at raw disk space usage should not be your guide for that. Forcing a gc generally works, but should not be relied upon (note suggest in
Re: Consurrent compactors
Not sure I understand the question. Do you have an example where a CF is not getting compacted ? The compaction tasks will be processed in the order they are submitted. If you have concurrent_compactors 1 then the thread pool for compactions (excluding validation compactions) will be able to process multiple compaction tasks in parallel. If you have a CF that gets a lot more traffic than other CF's it will require more compaction. But by running concurrent_compactors 1 smaller CF's should still be able to get through. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/02/2012, at 9:15 PM, Viktor Jevdokimov wrote: My concern is not anout cleanup, but about supposed „tendency of small sstables to accumulate during a single long running compactions“. When next task is for the same column family as currently long-running compaction, other column families compactions are freezed and concurrent_compactors 1setting just not working. Best regards/ Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063. Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania signature-logo29.png dm-exco4823.png Follow: tweet18be.png Visit our blog Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, February 01, 2012 21:51 To: user@cassandra.apache.org Subject: Re: Consurrent compactors (Assuming 1.0* release) From the comments in cassandra.yaml # Number of simultaneous compactions to allow, NOT including # validation compactions for anti-entropy repair. Simultaneous # compactions can help preserve read performance in a mixed read/write # workload, by mitigating the tendency of small sstables to accumulate # during a single long running compactions. The default is usually # fine and if you experience problems with compaction running too # slowly or too fast, you should look at # compaction_throughput_mb_per_sec first. # # This setting has no effect on LeveledCompactionStrategy. # # concurrent_compactors defaults to the number of cores. # Uncomment to make compaction mono-threaded, the pre-0.8 default. #concurrent_compactors: 1 If you set it to 1 then only 1 compaction should run at a time, excluding validation. How often do you run a cleanup compaction ? They are only necessary when you perform a token move. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 1/02/2012, at 9:48 PM, Viktor Jevdokimov wrote: Hi, When concurrent compactors are set to more then 1, it’s rare when more than 1 compaction is running in parallel. Didn’t checked the source code, but it looks like when next compaction task (any of minor, major, or cleanup) is for the same CF, it will not start in parallel and next tasks are not checked. Will it be possible to check all tasks, not only the next one, to find which of them can be started? This is actual especially when nightly cleanup is running, a lot of cleanup tasks are pending, regular minor compactions are waiting until all cleanup compactions are finished. Best regards/ Pagarbiai Viktor Jevdokimov Senior Developer Email: viktor.jevdoki...@adform.com Phone: +370 5 212 3063. Fax: +370 5 261 0453 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania signature-logo29.png dm-exco4823.png Follow: tweet18be.png Visit our blog Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: Write latency of counter updates across multiple rows
I'm not thinking about counters specifically here, and assuming you are sending batch mutations of the same size… The mutations (inserts, counter increments) for a row are turned into a single task server side, and are then processed in a serial fashion. If you send a mutation for 2 rows it will be turned into two tasks, which can then be processed in parallel. There is an point of dimensioning returns here. Each row you write to or read from will become a task, if you write to 1,000 rows at once you will put 1,000 tasks in the thread pool which typically has 32 concurrent threads. This may block / add latency to other requests. It's more of an issue with reads than writes. Does that apply to your situation ? - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 4/02/2012, at 1:19 AM, Amit Chavan wrote: Hi, In our use case, we maintain minute-wise roll ups for different metrics. These are stored in a counter column family where the row key is a composite containing the timestamp rounded to the last minute and an integer between 0-9 (This integer is calculated as the MD5 hash of the metric mod 10). The column names are the metrics we wish to track. Typically, each row has about 100,000 counters. We tested two scenarios. The first one is as mentioned above. In this case we got a per write latency of about 80 micro-seconds to 100 micro-seconds. In the other scenario, we calculated the integer in the row key as mod 100. In this case we observed a per write latency of 50 micro-seconds to 70 micro-seconds. I wish to understand why updates to counters were faster as they got spread across multiple rows? Cluster summary : 4 nodes running Cassandra 1.0.5. Each with 8 cores, 32G RAM, 10G Cassandra heap. We are using replication factor of 2. -- Thanks! Amit Chavan
Re: nodetool hangs and didn't print anything with firewall
Does it work with iptables disabled? You could add log to your firewall rules to see if firewall is dropping the packets. On Sun, Feb 5, 2012 at 5:35 PM, Roshan codeva...@gmail.com wrote: Hi I have 2 node Cassandra cluster and each linux box configured with a firewall. The ports 7000, 7199 and 9160 are open in the firewall and I can telnet to the ports from both ends without any issue. But if I try to do a nodetool from one of the Cassandra node, it hangs and didn't print anything. $ sh nodetool -h app1 info Could someone please help me on this? Thanks. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-hangs-and-didn-t-print-anything-with-firewall-tp7257286p7257286.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Best way to know the cluster status
Hi, What is the best way to know the cluster status via php? Currently we are trying to connect to individual cassandra instance with a specified timeout and if it fails we report the node to be down. But this test remains faulty. What are the other ways to test availability of nodes in cassandra cluster? How does datastax opscenter manage to do that? Regards, Tamil Selvan