unsibscribe
row cache -- does it have data from other nodes?
Hello, when I chose to have a rowcache -- will it contain data that is owned by other nodes? Thanks Maxim
Re: Server Side Logic/Script - Triggers / StoreProc
About a year ago I started getting a strange feeling that the noSQL community is busy re-creating RDBMS in minute detail. Why did we bother in the first place? Maxim On 4/27/2012 6:49 PM, Data Craftsman wrote: Howdy, Some Polyglot Persistence(NoSQL) products started support server side scripting, similar to RDBMS store procedure. E.g. Redis Lua scripting. I wish it is Python when Cassandra has the server side scripting feature. FYI, http://antirez.com/post/250 http://nosql.mypopescu.com/post/19949274021/alchemydb-an-integrated-graphdb-rdbms-kv-store server side scripting support is an extremely powerful tool. Having processing close to data (i.e. data locality) is a well known advantage, ..., it can open the doors to completely new features. Thanks, Charlie (@mujiang) 一个 木匠 === Data Architect Developer http://mujiang.blogspot.com On Sun, Apr 22, 2012 at 9:35 AM, Brian O'Neill boneil...@gmail.com wrote: Praveen, We are certainly interested. To get things moving we implemented an add-on for Cassandra to demonstrate the viability (using AOP): https://github.com/hmsonline/cassandra-triggers Right now the implementation executes triggers asynchronously, allowing you to implement a java interface and plugin your own java class that will get called for every insert. Per the discussion on 1311, we intend to extend our proof of concept to be able to invoke scripts as well. (minimally we'll enable javascript, but we'll probably allow for ruby and groovy as well) -brian On Apr 22, 2012, at 12:23 PM, Praveen Baratam wrote: I found that Triggers are coming in Cassandra 1.2 (https://issues.apache.org/jira/browse/CASSANDRA-1311) but no mention of any StoreProc like pattern. I know this has been discussed so many times but never met with any initiative. Even Groovy was staged out of the trunk. Cassandra is great for logging and as such will be infinitely more useful if some logic can be pushed into the Cassandra cluster nearer to the location of Data to generate a materialized view useful for applications. Server Side Scripts/Routines in Distributed Databases could soon prove to be the differentiating factor. Let me reiterate things with a use case. In our application we store time series data in wide rows with TTL set on each point to prevent data from growing beyond acceptable limits. Still the data size can be a limiting factor to move all of it from the cluster node to the querying node and then to the application via thrift for processing and presentation. Ideally we should process the data on the residing node and pass only the materialized view of the data upstream. This should be trivial if Cassandra implements some sort of server side scripting and CQL semantics to call it. Is anybody else interested in a similar feature? Is it being worked on? Are there any alternative strategies to this problem? Praveen -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra search performance
Jason, I'm using plenty of secondary indexes with no problem at all. Looking at your example,as I think you understand, you forgo indexes by combining two conditions in one query, thinking along the lines of what is often done in RDBMS. A scan is expected in this case, and there is no magic to avoid it. However, if this query is important, you can easily index on two conditions, using a composite type (look it up), or string concatenation for quick and easy solution. Which is, you _create an additional column_ which contains a combination of the two you want to use in a query. Then index on it. Problem solved. The composite solution is more elegant but what I describe works in simple cases. It works for me. Maxim On 4/25/2012 10:45 AM, Jason Tang wrote: 1.0.8 在 2012年4月25日 下午10:38,Philip Shon philip.s...@gmail.com mailto:philip.s...@gmail.com写 道: what version of cassandra are you using. I found a big performance hit when querying on the secondary index. I came across this bug in versions prior to 1.1 https://issues.apache.org/jira/browse/CASSANDRA-3545 Hope that helps. 2012/4/25 Jason Tang ares.t...@gmail.com mailto:ares.t...@gmail.com And I found, if I only have the search condition status, it only scan 200 records. But if I combine another condition partition then it scan all records because partition condition match all records. But combine with other condition such as userName, even all userName is same in the 1,000,000 records, it only scan 200 records. So it impacted by scan execution plan, if we have several search conditions, how it works? Do we have the similar execution plan in Cassandra? 在 2012年4月25日 下午9:18,Jason Tang ares.t...@gmail.com mailto:ares.t...@gmail.com写 道: Hi We have the such CF, and use secondary index to search for simple data status, and among 1,000,000 row records, we have 200 records with status we want. But when we start to search, the performance is very poor, and check with the command ./bin/nodetool -h localhost -p 8199 cfstats , Cassandra read 1,000,000 records, and Read Latency is 0.2 ms, so totally it used 200 seconds. It use lots of CPU, and check the stack, all thread in Cassandra is read from socket. So I wonder, how to really use index to find the 200 records instead of scan all rows. (Supper Column?) /ColumnFamily: queue/ /Key Validation Class: org.apache.cassandra.db.marshal.BytesType/ /Default column value validator: org.apache.cassandra.db.marshal.BytesType/ /Columns sorted by: org.apache.cassandra.db.marshal.BytesType/ /Row cache size / save period in seconds / keys to save : 0.0/0/all/ /Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider/ /Key cache size / save period in seconds: 0.0/0/ /GC grace seconds: 0/ /Compaction min/max thresholds: 4/32/ /Read repair chance: 0.0/ /Replicate on write: false/ /Bloom Filter FP chance: default/ /Built indexes: [queue.idxStatus]/ /Column Metadata:/ /Column Name: status (737461747573)/ /Validation Class: org.apache.cassandra.db.marshal.AsciiType/ /Index Name: idxStatus/ /Index Type: KEYS/ / / BRs //Jason
Re: RMI/JMX errors, weird
Hello Aaron, it's probably the over-optimistic number of concurrent compactors that was tripping the system. I do not entirely understand what's the correlation here, maybe it's that the compactors were overloading the neighboring nodes causing time-outs. I tuned the concurrency down and after a while things seem to have settled down, thanks for the suggestion. Maxim On 4/19/2012 4:13 PM, aaron morton wrote: 1150 pending tasks, and is not making progress. Not all pending tasks reported by nodetool compactionstats actually run. Once they get a chance to run the files they were going to work on may have already been compacted. Given that repair tests at double the phi threshold, it may not make much difference. Did other nodes notice it was dead ? Was there anything in the log that showed it was under duress (GC or dropped message logs) ? Is the compaction a consequence of repair ? (The streaming stage can result in compactions). Or do you think the node is just behind on compactions ? If you feel compaction is hurting the node, consider setting concurrent_compactors in the yaml to 2. You can also isolate the node from updates using nodetool disablegossip and disablerthrift , and the turn off the IO limiter with nodetool setcompactionthroughput 0. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/04/2012, at 12:29 AM, Maxim Potekhin wrote: Hello Aaron, how should I go about fixing that? Also, after a repeated attempt to compact it goes again into building secondary index with 1150 pending tasks, and is not making progress. I suspected the disk system failure, but this needs to be confirmed. So basically, do I need to tune the phi threshold up? Thing is, there was no heavy load on the cluster at all. Thanks Maxim On 4/19/2012 7:06 AM, aaron morton wrote: At some point the gossip system on the node this log is from decided that 130.199.185.195 was DOWN. This was based on how often the node was gossiping to the cluster. The active repair session was informed. And to avoid failing the job unnecessarily it tested that the errant nodes phi value was twice the configured phi_convict_threshold. It was and the repair was killed. Take a look at the logs on 130.199.185.195 and see if anything was happening on the node at the same time. Could be GC or an overloaded node (it would log about dropped messages). Perhaps other nodes also saw 130.199.185.195 as down? it only needed to be down for a few seconds. Hope that helps. - Aaron Morton
Re: RMI/JMX errors, weird
) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-1b3453b6-28b5-4abd-84ce-0326 b5468064, endpoint /130.199.185.193 died at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) ... 3 more Caused by: java.io.IOException: Problem during repair session manual-repair-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint /130.199. 185.193 died at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:723) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:760) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:165) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) On 4/12/2012 10:03 PM, aaron morton wrote: Look at the server side logs for errors. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 13/04/2012, at 11:47 AM, Maxim Potekhin wrote: Hello, I'm doing compactions under 0.8.8. Recently, I started seeing a stack trace like one below, and I can't figure out what causes this to appear. The cluster has been in operation for mode than half a year w/o errors like this one. Any help will be appreciated, Thanks Maxim WARNING: Failed to check the connection: java.net.SocketTimeoutException: Read timed out Exception in thread main java.io.IOException: Repair command #1: some repair session(s) failed (see log for details). at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303
Re: Is the secondary index re-built under compaction?
Thanks Aaaron. Just to be clear, every time I do a compaction, I rebuild all indexes from scratch. Right? Maxim On 4/17/2012 6:16 AM, aaron morton wrote: Yes secondary index builds are done via the compaction manager. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/04/2012, at 1:06 PM, Maxim Potekhin wrote: I noticed that nodetool compactionstats shows the building of the secondary index while I initiate compaction. Is this to be expected? Cassandra version 0.8.8. Thank you Maxim
Re: Is the secondary index re-built under compaction?
Thanks Jake. Then I am definitely seeing weirdness, as there are tons of pending tasks in compaction stats, and tons of index files created in the data directory. Plus it does tell me that it is building the secondary index, and that seems to be happening at an amazingly glacial pace. I have 2 CFs there, with multiple secondary indexes. I'll try to compact the CF one by one, reboot and see if that helps. Maxim On 4/17/2012 9:53 AM, Jake Luciani wrote: No, the indexes are not rebuilt every compaction. Only if you manually rebuild or bootstrap a new node does it use compaction manager to rebuild. On Tue, Apr 17, 2012 at 9:47 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Aaaron. Just to be clear, every time I do a compaction, I rebuild all indexes from scratch. Right? Maxim On 4/17/2012 6:16 AM, aaron morton wrote: Yes secondary index builds are done via the compaction manager. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/04/2012, at 1:06 PM, Maxim Potekhin wrote: I noticed that nodetool compactionstats shows the building of the secondary index while I initiate compaction. Is this to be expected? Cassandra version 0.8.8. Thank you Maxim -- http://twitter.com/tjake
Re: Is the secondary index re-built under compaction?
I understand that indexes are CFs. But the compaction stats says it's building the index, not compacting the corresponding CF. Either that's an ambiguous diagnostic, or indeed something is not right with my rig as of late. Maxim On 4/17/2012 10:05 AM, Jake Luciani wrote: Well, the since the secondary indexes are themselves column families they too are compacted along with everything else. On Tue, Apr 17, 2012 at 10:02 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Jake. Then I am definitely seeing weirdness, as there are tons of pending tasks in compaction stats, and tons of index files created in the data directory. Plus it does tell me that it is building the secondary index, and that seems to be happening at an amazingly glacial pace. I have 2 CFs there, with multiple secondary indexes. I'll try to compact the CF one by one, reboot and see if that helps. Maxim On 4/17/2012 9:53 AM, Jake Luciani wrote: No, the indexes are not rebuilt every compaction. Only if you manually rebuild or bootstrap a new node does it use compaction manager to rebuild. On Tue, Apr 17, 2012 at 9:47 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Aaaron. Just to be clear, every time I do a compaction, I rebuild all indexes from scratch. Right? Maxim On 4/17/2012 6:16 AM, aaron morton wrote: Yes secondary index builds are done via the compaction manager. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/04/2012, at 1:06 PM, Maxim Potekhin wrote: I noticed that nodetool compactionstats shows the building of the secondary index while I initiate compaction. Is this to be expected? Cassandra version 0.8.8. Thank you Maxim -- http://twitter.com/tjake -- http://twitter.com/tjake
Re: Is the secondary index re-built under compaction?
Yes. Sorry I didn't mention this, but of course I'm checking on indexes once in a while. So yes, they are marked as built. All of this started happening after a few days of continuous loading process. Since the nodes have good hardware (24 cores + SSD), the apparent load on each node was nothing remarkable, even at 20kHz insertion rate. But maybe I'm being overoptimistic. Maxim On 4/17/2012 10:12 AM, Jake Luciani wrote: Hmm that does sound fishy. When you run show keyspaces from cassandra-cli it shows which indexes are built. Are they marked built in your column family? -Jake On Tue, Apr 17, 2012 at 10:09 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: I understand that indexes are CFs. But the compaction stats says it's building the index, not compacting the corresponding CF. Either that's an ambiguous diagnostic, or indeed something is not right with my rig as of late. Maxim On 4/17/2012 10:05 AM, Jake Luciani wrote: Well, the since the secondary indexes are themselves column families they too are compacted along with everything else. On Tue, Apr 17, 2012 at 10:02 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Jake. Then I am definitely seeing weirdness, as there are tons of pending tasks in compaction stats, and tons of index files created in the data directory. Plus it does tell me that it is building the secondary index, and that seems to be happening at an amazingly glacial pace. I have 2 CFs there, with multiple secondary indexes. I'll try to compact the CF one by one, reboot and see if that helps. Maxim On 4/17/2012 9:53 AM, Jake Luciani wrote: No, the indexes are not rebuilt every compaction. Only if you manually rebuild or bootstrap a new node does it use compaction manager to rebuild. On Tue, Apr 17, 2012 at 9:47 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Aaaron. Just to be clear, every time I do a compaction, I rebuild all indexes from scratch. Right? Maxim On 4/17/2012 6:16 AM, aaron morton wrote: Yes secondary index builds are done via the compaction manager. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/04/2012, at 1:06 PM, Maxim Potekhin wrote: I noticed that nodetool compactionstats shows the building of the secondary index while I initiate compaction. Is this to be expected? Cassandra version 0.8.8. Thank you Maxim -- http://twitter.com/tjake -- http://twitter.com/tjake -- http://twitter.com/tjake
Re: Is the secondary index re-built under compaction?
The offending CF only has one. The other one, that seems to behave well, has nine. Maxim On 4/17/2012 10:20 AM, Jake Luciani wrote: How many indexes are there? On Tue, Apr 17, 2012 at 10:16 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Yes. Sorry I didn't mention this, but of course I'm checking on indexes once in a while. So yes, they are marked as built. All of this started happening after a few days of continuous loading process. Since the nodes have good hardware (24 cores + SSD), the apparent load on each node was nothing remarkable, even at 20kHz insertion rate. But maybe I'm being overoptimistic. Maxim On 4/17/2012 10:12 AM, Jake Luciani wrote: Hmm that does sound fishy. When you run show keyspaces from cassandra-cli it shows which indexes are built. Are they marked built in your column family? -Jake On Tue, Apr 17, 2012 at 10:09 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: I understand that indexes are CFs. But the compaction stats says it's building the index, not compacting the corresponding CF. Either that's an ambiguous diagnostic, or indeed something is not right with my rig as of late. Maxim On 4/17/2012 10:05 AM, Jake Luciani wrote: Well, the since the secondary indexes are themselves column families they too are compacted along with everything else. On Tue, Apr 17, 2012 at 10:02 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Jake. Then I am definitely seeing weirdness, as there are tons of pending tasks in compaction stats, and tons of index files created in the data directory. Plus it does tell me that it is building the secondary index, and that seems to be happening at an amazingly glacial pace. I have 2 CFs there, with multiple secondary indexes. I'll try to compact the CF one by one, reboot and see if that helps. Maxim On 4/17/2012 9:53 AM, Jake Luciani wrote: No, the indexes are not rebuilt every compaction. Only if you manually rebuild or bootstrap a new node does it use compaction manager to rebuild. On Tue, Apr 17, 2012 at 9:47 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Aaaron. Just to be clear, every time I do a compaction, I rebuild all indexes from scratch. Right? Maxim On 4/17/2012 6:16 AM, aaron morton wrote: Yes secondary index builds are done via the compaction manager. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 17/04/2012, at 1:06 PM, Maxim Potekhin wrote: I noticed that nodetool compactionstats shows the building of the secondary index while I initiate compaction. Is this to be expected? Cassandra version 0.8.8. Thank you Maxim -- http://twitter.com/tjake -- http://twitter.com/tjake -- http://twitter.com/tjake -- http://twitter.com/tjake
Is the secondary index re-built under compaction?
I noticed that nodetool compactionstats shows the building of the secondary index while I initiate compaction. Is this to be expected? Cassandra version 0.8.8. Thank you Maxim
RMI/JMX errors, weird
Hello, I'm doing compactions under 0.8.8. Recently, I started seeing a stack trace like one below, and I can't figure out what causes this to appear. The cluster has been in operation for mode than half a year w/o errors like this one. Any help will be appreciated, Thanks Maxim WARNING: Failed to check the connection: java.net.SocketTimeoutException: Read timed out Exception in thread main java.io.IOException: Repair command #1: some repair session(s) failed (see log for details). at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
a very simple indexing question (strange thing seen in CLI)
Greetings, Cassandra 0.8.8 is used. I'm trying to create an additional CF which is trivial in all respects. Just ascii columns and a few indexes. This is how I add an index: update column family files with column_metadata = [{column_name : '1', validation_class : AsciiType, index_type : 0, index_name : 'pandaid'}]; When I do show keyspaces, I see this: ColumnFamily: files Key Validation Class: org.apache.cassandra.db.marshal.BytesType Default column value validator: org.apache.cassandra.db.marshal.BytesType Columns sorted by: org.apache.cassandra.db.marshal.BytesType Row cache size / save period in seconds: 0.0/0 Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider Key cache size / save period in seconds: 20.0/14400 Memtable thresholds: 2.2828125/1440/487 (millions of ops/minutes/MB) GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 Replicate on write: true Built indexes: [files.pandaid] Column Metadata: Column Name: (01) Validation Class: org.apache.cassandra.db.marshal.AsciiType Index Name: pandaid Index Type: KEYS First off, why do I see (01)? I have a similar CF where I just see 1. Before inserting the data, I did assume to ascii on the keys, comparator and validator. The index has been built. When I try to access the data via the index, I get this: [default@PANDA] get files where '1'='1460103677'; InvalidRequestException(why:No indexed columns present in index clause with operator EQ) What is happening? Sorry for the admittedly trivial question, obviously I'm stuck with something quite simple which I managed to do with zero effort in the past. Maxim
Re: import
Since Python has a native csv module, it's trivial to achieve. I load lots of csv data into my database daily. Maxim On 3/27/2012 11:44 AM, R. Verlangen wrote: You can write your own script to parse the excel file (export as csv) and import it with batch inserts. Should be pretty easy if you have experience with those techniques. 2012/3/27 puneet loya puneetl...@gmail.com mailto:puneetl...@gmail.com I want to import files from excel to cassandra? Is it possible?? Any tool that can help?? Whats the best way?? Plz reply :) -- With kind regards, Robin Verlangen www.robinverlangen.nl http://www.robinverlangen.nl
Building a brand new cluster and readying it for production -- advice needed
Dear All, after all the testing and continuous operation of my first cluster, I've been given an OK to build a second production Cassandra cluster in Europe. There were posts in recent weeks regarding the most stable and solid Cassandra version. I was wondering is anything better has appeared since it was last discussed. At this juncture, I don't need features, just rock solid stability. Are 0.8.* versions still acceptable, since I have experience with these, or should I take the plunge to 1+? I realize that I won't need more than 8GB RAM because I can't make Java heap too big. Is worth it still to pay money for extra RAM? Is the cache located outside of heap in recent versions? Thanks to all of you for the advice I'm receiving on this board. Best regards Maxim
Re: Implications of length of column names
When I migrated data from our RDBMS, I hashed columns names to integers. This makes for some footwork, but the space gain is clearly there so it's worth it. I de-hash on read. Maxim On 2/10/2012 5:15 PM, Narendra Sharma wrote: It is good to have short column names. They save space all the way from network transfer to in-memory usage to storage. It is also good idea to club immutables columns that are read together and store as single column. We gained significant overall performance benefits with this. -Naren On Fri, Feb 10, 2012 at 12:20 PM, Drew Kutcharian d...@venarc.com mailto:d...@venarc.com wrote: What are the implications of using short vs long column names? Is it better to use short column names or longer ones? I know for MongoDB you are better of using short field names http://www.mongodb.org/display/DOCS/Optimizing+Storage+of+Small+Objects Does this apply to Cassandra column names? -- Drew -- Narendra Sharma Software Engineer /http://www.aeris.com http://www.persistentsys.com/ /http://narendrasharma.blogspot.com//
Please advise -- 750MB object possible?
Hello everybody, I'm being asked whether we can serve an object, which I assume is a blob, of 750MB size? I guess the real question is of how to chunk it and/or even it's possible to chunk it. Thanks! Maxim
Re: Please advise -- 750MB object possible?
The idea was to provide redundancy, resilience, automatic load balancing and automatic repairs. Going the way of the file system does not achieve any of that. Maxim On 2/22/2012 1:34 PM, Mohit Anchlia wrote: Outside on the file system and a pointer to it in C* On Wed, Feb 22, 2012 at 10:03 AM, Rafael Almeida almeida...@yahoo.com mailto:almeida...@yahoo.com wrote: Keep them where? *From:* Mohit Anchlia mohitanch...@gmail.com mailto:mohitanch...@gmail.com *To:* user@cassandra.apache.org mailto:user@cassandra.apache.org *Cc:* potek...@bnl.gov mailto:potek...@bnl.gov *Sent:* Wednesday, February 22, 2012 3:44 PM *Subject:* Re: Please advise -- 750MB object possible? In my opinion if you are busy site or application keep blobs out of the database. On Wed, Feb 22, 2012 at 9:37 AM, Dan Retzlaff dretzl...@gmail.com mailto:dretzl...@gmail.com wrote: Chunking is a good idea, but you'll have to do it yourself. A few of the columns in our application got quite large (maybe ~150MB) and the failure mode was RPC timeout exceptions. Nodes couldn't always move that much data across our data center interconnect in the default 10 seconds. With enough heap and a faster network you could probably get by without chunking, but it's not ideal. On Wed, Feb 22, 2012 at 9:04 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Hello everybody, I'm being asked whether we can serve an object, which I assume is a blob, of 750MB size? I guess the real question is of how to chunk it and/or even it's possible to chunk it. Thanks! Maxim
Re: Please advise -- 750MB object possible?
Thank you so much, looks nice, I'll be looking into it. On 2/22/2012 3:08 PM, Rob Coli wrote: On Wed, Feb 22, 2012 at 10:37 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: The idea was to provide redundancy, resilience, automatic load balancing and automatic repairs. Going the way of the file system does not achieve any of that. (Apologies for continuing slightly OT thread, but if people google and find this thread, I'd like to to contain the below relevant suggestion.. :D) With the caveat that you would have to ensure that your client code streams instead of buffering the entire object, you probably want something like MogileFS : http://danga.com/mogilefs/ I have operated a sizable MogileFS cluster for Digg, and it was one of the simplest, most comprehensible and least error prone parts of our infrastructure. A++ would run again. -- =Robert Coli rc...@palominodb.com mailto:rc...@palominodb.com
Re: nodetool hangs and didn't print anything with firewall
That's good to hear because it does present a problem for a strictly manages and firewalled campus environment. Maxim On 2/6/2012 11:57 AM, Nick Bailey wrote: JMX is not very firewall friendly. The problem is that JMX is a two connection process. The first connection happens on port 7199 and the second connection happens on some random port 1024. Work on changing this behavior was started in this ticket: https://issues.apache.org/jira/browse/CASSANDRA-2967 On Mon, Feb 6, 2012 at 2:02 AM, R. Verlangenro...@us2.nl wrote: Do you allow both outbound as inbound traffic? You might also try allowing both TCP as UDP. 2012/2/6 Roshancodeva...@gmail.com Yes, If the firewall is disable it works. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-hangs-and-didn-t-print-anything-with-firewall-tp7257286p7257310.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Encrypting traffic between Hector client and Cassandra server
Hello, do you see any value in having a web service over cassandra, with actual client-clients talking to it via https/ssl? This way the cluster can be firewalled and therefore protected, plus you get decent auth/auth right there. Maxim On 1/31/2012 5:21 PM, Xaero S wrote: I have been trying to figure out how to secure/encrypt the traffic between the client (Hector) and the Cassandra Server. I looked at this link https://issues.apache.org/jira/browse/THRIFT-106 But since thrift sits on a layer after Hector, i am wondering how i can get Hector to use the right Thrift calls to have the encryption happen? Also where can i get the instructions for the any required setup for encrypting the traffic between the Hector client and the Cassandra Server? Would appreciate any help in this regard. Below are the setup versions Cassandra Version - 0.8.7 Hector - 0.8.0-2 libthrift jar - 0.6.1 On a side note, we have setup internode encryption on the Cassandra server side and found the documentation for that easily.
Re: Restart cassandra every X days?
Sorry if this has been covered, I was concentrating solely on 0.8x -- can I just d/l 1.0.x and continue using same data on same cluster? Maxim On 1/28/2012 7:53 AM, R. Verlangen wrote: Ok, seems that it's clear what I should do next ;-) 2012/1/28 aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com There are no blockers to upgrading to 1.0.X. A - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/01/2012, at 7:48 AM, R. Verlangen wrote: Ok. Seems that an upgrade might fix these problems. Is Cassandra 1.x.x stable enough to upgrade for, or should we wait for a couple of weeks? 2012/1/27 Edward Capriolo edlinuxg...@gmail.com mailto:edlinuxg...@gmail.com I would not say that issuing restart after x days is a good idea. You are mostly developing a superstition. You should find the source of the problem. It could be jmx or thrift clients not closing connections. We don't restart nodes on a regiment they work fine. On Thursday, January 26, 2012, Mike Panchenko m...@mihasya.com mailto:m...@mihasya.com wrote: There are two relevant bugs (that I know of), both resolved in somewhat recent versions, which make somewhat regular restarts beneficial https://issues.apache.org/jira/browse/CASSANDRA-2868 (memory leak in GCInspector, fixed in 0.7.9/0.8.5) https://issues.apache.org/jira/browse/CASSANDRA-2252 (heap fragmentation due to the way memtables used to be allocated, refactored in 1.0.0) Restarting daily is probably too frequent for either one of those problems. We usually notice degraded performance in our ancient cluster after ~2 weeks w/o a restart. As Aaron mentioned, if you have plenty of disk space, there's no reason to worry about cruft sstables. The size of your active set is what matters, and you can determine if that's getting too big by watching for iowait (due to reads from the data partition) and/or paging activity of the java process. When you hit that problem, the solution is to 1. try to tune your caches and 2. add more nodes to spread the load. I'll reiterate - looking at raw disk space usage should not be your guide for that. Forcing a gc generally works, but should not be relied upon (note suggest in http://docs.oracle.com/javase/6/docs/api/java/lang/System.html#gc() http://docs.oracle.com/javase/6/docs/api/java/lang/System.html#gc%28%29). It's great news that 1.0 uses a better mechanism for releasing unused sstables. nodetool compact triggers a major compaction and is no longer a recommended by datastax (details here http://www.datastax.com/docs/1.0/operations/tuning#tuning-compaction bottom of the page). Hope this helps. Mike. On Wed, Jan 25, 2012 at 5:14 PM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: That disk usage pattern is to be expected in pre 1.0 versions. Disk usage is far less interesting than disk free space, if it's using 60 GB and there is 200GB thats ok. If it's using 60Gb and there is 6MB free thats a problem. In pre 1.0 the compacted files are deleted on disk by waiting for the JVM do decide to GC all remaining references. If there is not enough space (to store the total size of the files it is about to write or compact) on disk GC is forced and the files are deleted. Otherwise they will get deleted at some point in the future. In 1.0 files are reference counted and space is freed much sooner. With regard to regular maintenance, node tool cleanup remvos data from a node that it is no longer a replica for. This is only of use when you have done a token move. I would not recommend a daily restart of the cassandra process. You will lose all the run time optimizations the JVM has made (i think the mapped files pages will stay resident). As well as adding additional entropy to the system which must be repaired via HH, RR or nodetool repair. If you want to see compacted files purged faster the best approach would be to upgrade to 1.0. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com http://www.thelastpickle.com/ On 26/01/2012, at 9:51 AM, R. Verlangen wrote: In his message he explains that it's for Forcing a GC . GC stands for garbage collection. For some more background see: http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
Problematic deletes in 0.8.8
Hello, after I thought I was out of the woods with data deletion in 0.8.8, I unfortunately see undead data and other strange behavior. Let me clarify: a) I do run repair and compaction well within GC_GRACE b) deletes happen daily c) after a few repairs, when I run an indexed query on the data that I tried to delete, it takes a while, even when the result is 0 rows. This I don't quite understand -- I thought that the index itself should be void of keys that were deleted. What takes so long? d) finally, even when doing repairs after deletes, I do see the data that is not supposed to be there. ideas? Maxim
Re: Restart cassandra every X days?
I also do repair, compact and cleanup every couple of days, and also have daily restarts on crontab. It doesn't hurt and I avoid having a node becoming unresponsive after many days of operation, that has happened before. Older files get cleaned up on restart. It doesn't take long to shut down and restart a node, so if there is enough replication in the cluster it's not any issue. Maxim On 1/25/2012 1:13 PM, Karl Hiramoto wrote: On 01/25/12 16:09, R. Verlangen wrote: Hi there, I'm currently running a 2-node cluster for some small projects that might need to scale-up in the future: that's why we chose Cassandra. The actual problem is that one of the node's harddrive usage keeps growing. For example: - after a fresh restart ~ 10GB - after a couple of days running ~ 60GB I know that Cassandra uses lots of diskspace but is this still normal? I'm running cassandra 0.8.7 I run 9 nodes with cassandra 0.7.8 and we see this same behaviour, but we keep it under control by doing the sequence: nodetool repair nodetool compact nodetool cleanup According to the 1.0.x changelog IIRC this disk usage is supposed to be improved. -- Karl
Re: Cassandra x MySQL Sharded - Insert Comparison
a) I hate to break it to you, but 6GB x 4 cores != 'high-end machine'. It's pretty much middle of the road consumer level these days. b) Hosting the client and Cassandra on the same node is a Bad Idea. It will depend on what exactly the client will do, but in my experience it won't work too well in general. c) Have you considered dual boot, so you can have a good operating system (as per Cassandra folks) in addition to Windows? Maxim On 1/22/2012 8:22 PM, Gustavo Gustavo wrote: Ok guys, thank you for the valuable hints you gave me. For sure, things will perform much better on a real hardware. But my object maybe isn't really to see what't the max throughput that the datastores have. It is more or less like, given an equal condition, which one would perform better. But I'll do this way, I'm going to use a high-end machine (6GB RAM, 4 cores) and run Cassandra, MySQL and the Client Test Application on the same machine. Unfortunately, I'll have to use Windows 7 as a host to the datastores. From your experience, do you think that even in single node, can Cassandra beat in inserts a RDBMS? I've seen that InnoDB (something that compares to the other databases relational engine) is pretty slow. But when it comes to MyISAM, things are much faster. /Gustavo 2012/1/22 Chris Gerken chrisger...@mindspring.com mailto:chrisger...@mindspring.com Edward (and Maxim), I agree. I was just recalling previous performance bake-offs (for other technologies, long time ago, galaxy far far away) in which the customer had put together a mockup of the high throughput expected in production and wanted to make a decision against that one set of numbers. We always found that both/all competing products could be made to run faster due to unexpected factors in the non-production test build. For our side, we always started simple and built up the throughput until we found a bottleneck. We fixed the bottleneck. Rinse and repeat. Chris Gerken chrisger...@mindspring.com mailto:chrisger...@mindspring.com 512.587.5261 tel:512.587.5261 http://www.linkedin.com/in/chgerken On Jan 22, 2012, at 8:51 AM, Edward Capriolo wrote: In some sense 1 for one performance almost does not matter. Thou I bet you can get Cassandra better (I remember old school ycsb white paper benches against a sharded mysql). One of the main bullet points of Cassandra is if you want to grow from 4 nodes, to 8 nodes, to 14 nodes, and so on, Cassandra is elastic and supports online adding and removing of nodes. A do-it-yourself hash mod this algorithm really has no upgrade path Edward On Sun, Jan 22, 2012 at 9:26 AM, Chris Gerken chrisger...@mindspring.com mailto:chrisger...@mindspring.com wrote: Howdy Gustavo, One thing that jumped out at me is your having put two cassandra images on the same box. There may be enough CPU and memory for the two images combined but you may be seeing some other resource not being shared so nicely - network card bandwidth, for example. More generally, the real question is what the bottleneck is (for both db's, actually). Start with Cassandra running in that configuration and start with one client thread sending one request a second. Look at the CPU, network and memory metrics for all boxes (including the client). Nothing should be even close to maxing out that that throughout. Now incrementally increase one of the test parameters (number of clients or number of inserts per second) just a bit (say from one transaction to 5) and note the above metrics. Keep slowly increasing the test parameters, one at a time, until one of the metrics maxes out. That's the bottleneck you're wondering about. Fix that and the db, be it Cassandra or MySQL) will move ahead of the other performance-wise. Turn your attention to the other db and repeat. - Chris Gerken On Jan 22, 2012, at 7:10 AM, Gustavo Gustavo wrote: Hello, I've set up a testing evironment for Cassandra and MySQL, to compare both, regarding *performance only*. And I must admit that I was expecting Cassandra to beat MySQL. But I've not seen this happening up to now. My application/use case is INSERT intensive, since I'm not updating anything, just inserting all the time. To compare both I created virtual machines with Ubuntu 11.10, and installed the latest versions of each datastore. Each VM has 1GB of RAM. I've used VMs as a way to give both datastores an equal sandbox. MySQL is set up to work as sharded, with 2 databases, that means that records are inserted to a specific instance based on key % 2. The engine is MyISAM (InnoDB was really slow and not
Re: Cassandra usage
You provide zero information on what you are planning to do with the data. Thus, your question is impossible to answer. On 1/24/2012 9:38 PM, francesco.tangari@gmail.com wrote: Do you think that for a standard project with 50.000.000 of rows on 2-3 machines cassandra is appropriate or i should use a normal dbms? -- francesco.tangari@gmail.com Inviato con Sparrow http://www.sparrowmailapp.com/?sig
Re: Cassandra x MySQL Sharded - Insert Comparison
Hello, I have some experience in benchmarking Cassandra against Oracle and in running on a VM cluster. While the VM solution will work for many applications, it simply won't cut it for all. In particular, I observed a large difference in insert performance when I moved from VM to real hardware. Why this is the case, can be due to bazillion factors, including the high core count on my real machines, and vastly better I/O. The CPU is crucial for inserts in Cassandra, and it may not be for RDBMS. Another factor is the potential bottleneck in the client. There are cases when you won't have enough muscle to handle the data in the client itself. None of this is definitive, but I'm just throwing in bit of my experience from the past 12 months. Right now I'm able to sink data at insane speeds far beyond these of Oracle. Maxim On 1/22/2012 8:10 AM, Gustavo Gustavo wrote: Hello, I've set up a testing evironment for Cassandra and MySQL, to compare both, regarding *performance only*. And I must admit that I was expecting Cassandra to beat MySQL. But I've not seen this happening up to now. My application/use case is INSERT intensive, since I'm not updating anything, just inserting all the time. To compare both I created virtual machines with Ubuntu 11.10, and installed the latest versions of each datastore. Each VM has 1GB of RAM. I've used VMs as a way to give both datastores an equal sandbox. MySQL is set up to work as sharded, with 2 databases, that means that records are inserted to a specific instance based on key % 2. The engine is MyISAM (InnoDB was really slow and not really needed to my case). There's a primary compound key (integer and datetime columns) in this test table. Let's name the nodes MySQL1 and MySQL2. Cassandra is set up to work with 4 nodes, with keys (tokens) set up to distribute records evenly across the 4 nodes (nodetool ring reports 25% to each node), replication factor 1 and RandomPartitioner, the other configs are left to default. Let's name the nodes Cassandra1, Cassandra2, Cassandra3 and Cassandra4. I'm using 2 physical machines (Windows7) to host the 4 (Cassandra) or 2 (MySQL) virtual machines, this way: Machine1: MySQL1, Cassandra1, Cassandra3 Machine2: MySQL2, Cassandra2, Cassandra4 The machines have CPU and RAM enough to host Cassandra Cluster or MySQL Cluster at a time. The client test applicatin is running in a third physical machine, with 8 threads doing inserts. The test application is written in C# (Windows7) using Aquiles high-level client. My use case is a vehicle tracking system. So, let's suppose, from minute to minute, the vehicle sends its position together with some other GPS data and vehicle status information. The columns in my Cassandra cluster are just the DateTime (long value) of a position for a specific vehicle, and the value is all the other data serialized to binary format. Therefore, my CF really grows in columns number. So all data is inserted only to one CF/Table named Positions. The key to Cassandra is the VehicleID and to MySQL VehicleID + PositionDateTime (MySQL creates an index to this automatically). Important to note that MySQL threw tons of connection exceptions, even though, the insert was retried until it got through MySQL. My test case was to insert 1k positions for 1k vehicles to 10 days - which gives 10.000.000 of inserts. The final thoughtput that my application had for this scenario was: Cassandra x 4 2012-01-21 11 tel:2012-01-21%2011:45:38,044 #6 [Logger.Log] INFO - Inserted 1 positions for 1000 vehicles (1000 inserts): 2012-01-21 11 tel:2012-01-21%2011:45:38,082 #6 [Logger.Log] INFO - Total Time: 2:37:03,359 2012-01-21 11 tel:2012-01-21%2011:45:38,085 #6 [Logger.Log] INFO - Throughput: 1061 inserts/s And for MySQL x 2 2012-01-21 14 tel:2012-01-21%2014:26:25,197 #6 [Logger.Log] INFO - Inserted 1 positions for 1000 vehicles (1000 inserts): 2012-01-21 14 tel:2012-01-21%2014:26:25,250 #6 [Logger.Log] INFO - Total Time: 2:06:25,914 2012-01-21 14 tel:2012-01-21%2014:26:25,263 #6 [Logger.Log] INFO - Throughput: 1318 inserts/s Is there something that I'm missing here? Is this excepted? Or the problem is somewhere else and that's hard to say looking at this description? Cheers, Gustavo
Re: delay in data deleting in cassadra
Did you run repairs withing GC_GRACE all the time? On 1/20/2012 3:42 AM, Shammi Jayasinghe wrote: Hi, I am experiencing a delay in delete operations in cassandra. Its as follows. I am running a thread which contains following three steps. Step 01: Read data from column family foo[1] Step 02: Process received data eg: bar1,bar2,bar3,bar4,bar5 Step 03: Remove those processed data from foo.[2] The problem occurs when this thread is invoked for the second time. In that step , it returns some of data that i already deleted in the third step of the previous cycle. Eg: it returns bar2,bar3,bar4,bar5 It seems though i called the remove operation as follows [2], it takes time to replicate it to the file system. If i make a thread sleep of 5 secs between the thread cycles, it does not give me any data that i deleted in the third step. [1] . SliceQueryString, String, byte[] sliceQuery = HFactory.createSliceQuery(keyspace, stringSerializer, stringSerializer, bs); sliceQuery.setKey(queueName); sliceQuery.setRange(, , false, messageCount); sliceQuery.setColumnFamily(USER_QUEUES_COLUMN_FAMILY); [2]. MutatorString mutator = HFactory.createMutator(keyspace, stringSerializer); mutator.addDeletion(queueName, USER_QUEUES_COLUMN_FAMILY, messageId, stringSerializer); mutator.execute(); Is there a solution for this. Cassadra version : 0.8.0 Libthrift version : 0.6.1 Thanks Shammi -- Best Regards,* Shammi Jayasinghe* Senior Software Engineer; WSO2, Inc.; http://wso2.com http://wso2.com/, mobile: +94 71 4493085
Re: Cassandra to Oracle?
What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: Cassandra to Oracle?
I certainly agree with difficult to predict. There is a Danish proverb, which goes it's difficult to make predictions, especially about the future. My point was that it's equally difficult with noSQL and RDBMS. The latter requires indexing to operate well, and that's a potential performance problem. On 1/20/2012 7:55 PM, Mohit Anchlia wrote: I think the problem stems when you have data in a column that you need to run adhoc query on which is not denormalized. In most cases it's difficult to predict the type of query that would be required. Another way of solving this could be to index the fields in search engine. On Fri, Jan 20, 2012 at 7:37 PM, Maxim Potekhinpotek...@bnl.gov wrote: What makes you think that RDBMS will give you acceptable performance? I guess you will try to index it to death (because otherwise the ad hoc queries won't work well if at all), and at this point you may be hit with a performance penalty. It may be a good idea to interview users and build denormalized views in Cassandra, maybe on a separate look-up cluster. A few percent of users will be unhappy, but you'll find it hard to do better. I'm talking from my experience with an industrial strength RDBMS which doesn't scale very well for what you call ad-hoc queries. Regards, Maxim On 1/20/2012 9:28 AM, Brian O'Neill wrote: I can't remember if I asked this question before, but We're using Cassandra as our transactional system, and building up quite a library of map/reduce jobs that perform data quality analysis, statistics, etc. ( 100 jobs now) But... we are still struggling to provide an ad-hoc query mechanism for our users. To fill that gap, I believe we still need to materialize our data in an RDBMS. Anyone have any ideas? Better ways to support ad-hoc queries? Effectively, our users want to be able to select count(distinct Y) from X group by Z. Where Y and Z are arbitrary columns of rows in X. We believe we can create column families with different key structures (using Y an Z as row keys), but some column names we don't know / can't predict ahead of time. Are people doing bulk exports? Anyone trying to keep an RDBMS in synch in real-time? -brian -- Brian ONeill Lead Architect, Health Market Science (http://healthmarketscience.com) mobile:215.588.6024 blog: http://weblogs.java.net/blog/boneill42/ blog: http://brianoneill.blogspot.com/
Re: ideal cluster size
You can also scale not horizontally but diagonally, i.e. raid SSDs and have multicore CPUs. This means that you'll have same performance with less nodes, making it far easier to manage. SSDs by themselves will give you an order of magnitude improvement on I/O. On 1/19/2012 9:17 PM, Thorsten von Eicken wrote: We're embarking on a project where we estimate we will need on the order of 100 cassandra nodes. The data set is perfectly partitionable, meaning we have no queries that need to have access to all the data at once. We expect to run with RF=2 or =3. Is there some notion of ideal cluster size? Or perhaps asked differently, would it be easier to run one large cluster or would it be easier to run a bunch of, say, 16 node clusters? Everything we've done to date has fit into 4-5 node clusters.
Re: Using 5-6 bytes for cassandra timestamps vs 8…
I must have accidentally deleted all messages in this thread save this one. On the face value, we are talking about saving 2 bytes per column. I know it can add up with many columns, but relative to the size of the column -- is it THAT significant? I made an effort to minimize my CF footprint by replacing the natural column keys with integers (and translating back and forth when writing and reading). It's easy to see that in my case I achieve almost 50% storage savings and at least 30%. But if the column in question contains more than 20 bytes -- what's up with trying to save 2? Cheers Maxim On 1/18/2012 11:49 PM, Ertio Lew wrote: I believe the timestamps *on per column basis* are only required until the compaction time after that it may also work if the timestamp range could be specified globally on per SST table basis. and thus the timestamps until compaction are only required to be measure the time from the initialization of the new memtable to the point the column is written to that memtable. Thus you can easily fit that time in 4 bytes. This I believe would save atleast 4 bytes overhead for each column. Is anything related to these overheads under consideration/ or planned in the roadmap ? On Tue, Sep 6, 2011 at 11:44 AM, Oleg Anastastasyevolega...@gmail.com wrote: I have a patch for trunk which I just have to get time to test a bit before I submit. It is for super columns and will use the super columns timestamp as the base and only store variant encoded offsets in the underlying columns. Could you please measure how much real benefit it brings (in real RAM consumption by JVM). It is hard to tell will it give noticeable results or not. AFAIK memory structures used for memtable consume much more memory. And 64-bit JVM allocates memory aligned to 64-bit word boundary. So 37% of memory consumption reduction looks doubtful.
Re: About initial token, autobootstraping and load balance
I see. Sure, that's a bit more complicated and you'd have to move tokens after adding a machine. Maxim On 1/15/2012 4:40 AM, ??? wrote: It's nothing wrong for 3 nodes. It's a problem for cluster of 20+ nodes, growing. 2012/1/14 Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov I'm just wondering -- what's wrong with manual specification of tokens? I'm so glad I did it and have not had problems with balancing and all. Before I was indeed stuck with 25/25/50 setup in a 3 machine cluster, when had to move tokens to make it 33/33/33 and I screwed up a little in that the first one did not start with 0, which is not a good idea. Maxim -- Best regards, Vitalii Tymchyshyn
Re: About initial token, autobootstraping and load balance
I'm just wondering -- what's wrong with manual specification of tokens? I'm so glad I did it and have not had problems with balancing and all. Before I was indeed stuck with 25/25/50 setup in a 3 machine cluster, when had to move tokens to make it 33/33/33 and I screwed up a little in that the first one did not start with 0, which is not a good idea. Maxim On 1/13/2012 2:10 PM, David McNelis wrote: The documentation for that section needs to be updated... What happens is that if you just autobootstrap without setting a token it will by default bisect the range of the largest node. So if you go through several iterations of adding nodes, then this is what you would see: Gen 1: Node A: 100% of tokens, token range 1-10 (for example) Gen 2: Node A: 50% of tokens (1-5) Node B: 50% of tokens (6-10) Gen 3: Node A: 25% of tokens (1-2.5) Node B: 50% of tokens (6-10) Node C: 25% of tokens (2.6-5) In reality, what you'd want in gen 3 is every node to be 33%, but it would not be the case without setting the tokens to begin with. You'll notice that there are a couple of scripts available to generate a list of initial tokens for your particular cluster size, then ever time you add a node you'll need to update all the nodes with new tokens in order to properly load balance. Does this make sense? Other folks, am I explaining this correctly? David 2012/1/13 Carlos Pérez Miguel cperez...@gmail.com mailto:cperez...@gmail.com Hello, I have a doubt about how initial token is determined. In Cassandra's documentation it is said that it is better to manually configure the initial token to each node in the system but also is said that if initial token is not defined and autobootstrap is true, new nodes choose initial token in order to better the load balance of the cluster. But what happens if no initial token is chosen and autobootstrap is not activated? How each node selects its initial token to balance the ring? I ask this because I am making tests with a 20 nodes cassandra cluster with cassandra 0.7.9. Any node has initial token, nor autobootstraping. I restart the cluster with each test I want to make and in the end the cluster is always well balanced. Thanks Carlos Pérez Miguel
Exception thrown during repair, contains jmx classes -- why?
As per below trace, there is jmx.mbeanserber involved. What I ran was a common repair. Is that right? What does this failure indicate? at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:305) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Should I throttle deletes?
Thanks, this makes sense. I'll try that. Maxim On 1/6/2012 10:51 AM, Vitalii Tymchyshyn wrote: Do you mean on writes? Yes, your timeouts must be so that your write batch could complete until timeout elapsed. But this will lower write load, so reads should not timeout. Best regards, Vitalii Tymchyshym 06.01.12 17:37, Philippe написав(ла): But you will then get timeouts. Le 6 janv. 2012 15:17, Vitalii Tymchyshyn tiv...@gmail.com mailto:tiv...@gmail.com a écrit : 05.01.12 22:29, Philippe написав(ла): Then I do have a question, what do people generally use as the batch size? I used to do batches from 500 to 2000 like you do. After investigating issues such as the one you've encountered I've moved to batches of 20 for writes and 256 for reads. Everything is a lot smoother : no more timeouts. I'd better reduce mutation thread pool with concurrent_writes setting. This will lower server load no matter, how many clients are sending batches, at the same time you still have good batching. Best regards, Vitalii Tymchyshyn
Re: How does Cassandra decide when to do a minor compaction?
Hello Alexandru, I just want to have a feel what activity to expect on the cluster. The load from minor compactions is not overwhelming but it seems non-negligible. Maxim On 1/7/2012 5:12 AM, Alexandru Sicoe wrote: Hi Maxim, Why do you need to know this? Cheers, Alex On Sat, Jan 7, 2012 at 10:03 AM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: http://www.datastax.com/docs/1.0/operations/tuning#tuning-compaction - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 7/01/2012, at 3:17 PM, Maxim Potekhin wrote: The subject says it all -- pointers appreciated. Thanks Maxim
How to find out when a nodetool operation has ended?
Suppose I start a repair on one or a few nodes in my cluster, from an interactive machine in the office, and leave for the day (which is a very realistic scenario imho). Is there a way to know, from a remote machine, when a particular action, such as compaction or repair, has been finished? I figured that compaction stats can be mum at times, thus it's not a reliable indicator. Many thanks, Maxim
Re: How to find out when a nodetool operation has ended?
Thanks, so I take it there is no solution outside of Opcenter. I mean of course I can redirect the output, with additional timestamps if needed, to a log file -- which I can access remotely. I just thought there would be some status command by chance, to tell me what maintenance the node is doing. Too bad there is not! Maxim On 1/6/2012 5:40 PM, R. Verlangen wrote: You might consider: - installing DataStax OpsCenter ( http://www.datastax.com/products/opscenter ) - starting the repair in a linux screen (so you can attach to the screen from another location)
How does Cassandra decide when to do a minor compaction?
The subject says it all -- pointers appreciated. Thanks Maxim
Re: Should I throttle deletes?
Hello Aaron, On 1/5/2012 4:25 AM, aaron morton wrote: I use a batch mutator in Pycassa to delete ~1M rows based on a longish list of keys I'm extracting from an auxiliary CF (with no problem of any sort). What is the size of the deletion batches ? 2000 mutations. Now, it appears that such heads-on delete puts a temporary but large load on the cluster. I have SSD's and they go to 100% utilization, and the CPU spikes to significant loads. Does the load spike during the deletion or after it ? During. Do any of the thread pool back up in nodetool tpstats during the load ? Haven't checked, thank you for the lead. I can think of a few general issues you may want to avoid: * Each row in a batch mutation is handled by a task in a thread pool on the nodes. So if you send a batch to delete 1,000 rows it will put 1,000 tasks in the Mutation stage. This will reduce the query throughput. Aah. I didn't know that. I was under the impression that batching saves the communication overhead, and that's it. Then I do have a question, what do people generally use as the batch size? Thanks Maxim
Re: Should I throttle deletes?
Thanks, that's quite helpful. I'm wondering though if multiplying the number of clients will end up doing same thing. On 1/5/2012 3:29 PM, Philippe wrote: Then I do have a question, what do people generally use as the batch size? I used to do batches from 500 to 2000 like you do. After investigating issues such as the one you've encountered I've moved to batches of 20 for writes and 256 for reads. Everything is a lot smoother : no more timeouts. The downside though is that I have to run more client threads in parallele to maximize throughput. Cheers
Re: Strange OOM when doing list in CLI
Ed, thanks for a dose of common sense, I should have thunk about it. In fact, I only have 2 columns in that one particular CF, but one of these can get really fat (for a good reason). So the CLI just plain runs out of memory when pulling the default 100 rows (with a little help from various overheads). It didn't happen before because the recent additions to the data were slightly fatter than in the beginning. Thanks Maxim On 1/3/2012 10:27 PM, Edward Capriolo wrote: What you are probably running into is that list from the cli can bring all the columns of a key into memory. I have counters using composite keys and about 1k columns causes this to happen. We should have some paging support with list. On Tuesday, January 3, 2012, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: I came back from Xmas vacation only to see that what always was an innocuous procedure in CLI now reliably results in OOM -- does anyone have ideas why? It never happened before. Version of Cassandra is 0.8.8. 2956 java -ea -javaagent:/home/cassandra/cassandra/bin/../lib/jamm-0.2.2.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 Xms8000M -Xmx8000M -Xmn2000M -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemakEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -X:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=199 [default@PANDA] list idxR; Using default limit of 100 Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:140) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:752) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:734) at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1379) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:266) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
Should I throttle deletes?
Now that my cluster appears to run smoothly and after a few successful repairs and compacts, I'm back in the business of deletion of portions of data based on its date of insertion. For reasons too lengthy to be explained here, I don't want to use TTL. I use a batch mutator in Pycassa to delete ~1M rows based on a longish list of keys I'm extracting from an auxiliary CF (with no problem of any sort). Now, it appears that such heads-on delete puts a temporary but large load on the cluster. I have SSD's and they go to 100% utilization, and the CPU spikes to significant loads. Does anyone do throttling on such mass-delete procedure? Thanks in advance, Maxim
Re: Cassandra WebUI with Sources released
Congrats on what seems to be a nice piece of work, need to check it out. Nicely complements other tools. Maxim On 1/2/2012 12:48 PM, Markus Wiesenbacher | Codefreun.de wrote: Hi, I wish you all a happy and healthy new year! As you may remember, I coded a little GUI for Apache Cassandra. Now I did set up a little project homepage where you can download it, including the sources: http://www.codefreun.de http://www.codefreunde.com Markus ;)
Strange OOM when doing list in CLI
I came back from Xmas vacation only to see that what always was an innocuous procedure in CLI now reliably results in OOM -- does anyone have ideas why? It never happened before. Version of Cassandra is 0.8.8. 2956 java -ea -javaagent:/home/cassandra/cassandra/bin/../lib/jamm-0.2.2.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 Xms8000M -Xmx8000M -Xmn2000M -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemakEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -X:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=199 [default@PANDA] list idxR; Using default limit of 100 Exception in thread main java.lang.OutOfMemoryError: Java heap space at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:140) at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:752) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:734) at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1379) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:266) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
Re: Doubts related to composite type column names/values
With regards to static, what are major benefits as it compares with string catenation (with some convenient separator inserted)? Thanks Maxim On 12/20/2011 1:39 PM, Richard Low wrote: On Tue, Dec 20, 2011 at 5:28 PM, Ertio Lewertio...@gmail.com wrote: With regard to the composite columns stuff in Cassandra, I have the following doubts : 1. What is the storage overhead of the composite type column names/values, The values are the same. For each dimension, there is 3 bytes overhead. 2. what exactly is the difference between the DynamicComposite and Static Composite ? Static composite type has the types of each dimension specified in the column family definition, so all names within that column family have the same type. Dynamic composite type lets you specify the type for each column, so they can be different. There is extra storage overhead for this and care must be taken to ensure all column names remain comparable.
Re: Doubts related to composite type column names/values
Thank you Aaron! As long as I have plain strings, would you say that I would do almost as well with catenation? Of course I realize that mixed types are a very different case where the composite is very useful. Thanks Maxim On 12/20/2011 2:44 PM, aaron morton wrote: Component values are compared in a type aware fashion, an Integer is an Integer. Not a 10 character zero padded string. You can also slice on the components. Just like with string concat, but nicer. . e.g. If you app is storing comments for a thing, and the column names have the form comment_id, field or Integer, String you can slice for all properties of a comment or all properties for comments between two comment_id's Finally, the client library knows what's going on. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 21/12/2011, at 7:43 AM, Maxim Potekhin wrote: With regards to static, what are major benefits as it compares with string catenation (with some convenient separator inserted)? Thanks Maxim On 12/20/2011 1:39 PM, Richard Low wrote: On Tue, Dec 20, 2011 at 5:28 PM, Ertio Lewertio...@gmail.com mailto:ertio...@gmail.com wrote: With regard to the composite columns stuff in Cassandra, I have the following doubts : 1. What is the storage overhead of the composite type column names/values, The values are the same. For each dimension, there is 3 bytes overhead. 2. what exactly is the difference between the DynamicComposite and Static Composite ? Static composite type has the types of each dimension specified in the column family definition, so all names within that column family have the same type. Dynamic composite type lets you specify the type for each column, so they can be different. There is extra storage overhead for this and care must be taken to ensure all column names remain comparable.
Can I slice on composite indexes?
Let's say I have rows with composite columns Like (key1, {('xyz', 'abc'): 'colval1'}, {('xyz', 'def'): 'colval2'}) (key2, {('ble', 'meh'): 'otherval'}) Is it possible to create a composite type index such that I can query on 'xyz' and get the first two columns? Thanks Maxim
Re: commit log size
Alexandru, Jeremiah -- what setting needs to be tweaked, and what's the recommended value? I observed similar behavior this morning. Maxim On 11/28/2011 2:53 PM, Jeremiah Jordan wrote: Yes, the low volume memtables are causing the problem. Lower the thresholds for those tables if you don't want the commit logs to go crazy. -Jeremiah On 11/28/2011 11:11 AM, Alexandru Dan Sicoe wrote: Hello everyone, 4 node Cassandra 0.8.5 cluster with RF=2, replica placement strategy = SimpleStartegy, write consistency level = ANY, memtable_flush_after_mins =1440; memtable_operations_in_millions=0.1; memtable_throughput_in_mb = 40; max_compaction_threshold =32; min_compaction_threshold =4; I have one keyspace with 1 CF for all the data and 3 other small CFs for metadata. I am using Datastax OpsCenter to monitor my cluster so there is another keyspace for monitoring. Everything works ok, the only thing I've noticed is this morning the commitlog of one node was 52GB, one was 25 GB and the others were around 3 GB. I left everything untouched and looked a couple of hours later and the 52GB one is now about 3GB and the 25 GB one is now 29 GB and the other two about the same as before. Are my commit logs growing because of small memtables which don't get flushed because they don't reach the operations and throughput limits? Then why do only some nodes exhibit this behaviour? It would be interesting to understand how to control the size of the commitlog also to know how to size my commitlog disks! Thanks, Alex
Re: Keys for deleted rows visible in CLI
Thanks, it makes perfect sense now. Well an option in cassandra could make it optional as far as display it concerned, w/o performance hit -- of course this is all unimportant. Thanks again Maxim On 12/14/2011 11:30 AM, Brandon Williams wrote: http://wiki.apache.org/cassandra/FAQ#range_ghosts On Wed, Dec 14, 2011 at 4:36 AM, Radim Kolarh...@sendmail.cz wrote: Dne 14.12.2011 1:15, Maxim Potekhin napsal(a): Thanks. It could be hidden from a human operator, I suppose :) I agree. Open JIRA for it.
Asymmetric load
What could be the reason I see unequal loads on a 3-node cluster? This all started happening during repairs (which again are not going smoothly). Maxim
Crazy compactionstats
Hello I ran repair like this: nohup repair.sh where repair.sh contains simply nodetool repair plus timestamp. The process dies while dumping this: Exception in thread main java.io.IOException: Repair command #1: some repair session(s) failed (see log for details). at org.apache.cassandra.service.StorageService.forceTableRepair(StorageService.java:1613) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:27) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208) at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:836) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1427) at javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl.java:72) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1265) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1360) at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:788) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) at sun.rmi.transport.Transport$1.run(Transport.java:159) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Transport.java:155) at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:790) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:649) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) I still see pending tasks in nodetool compactionstats, and their number goes into hundreds which I haven't seen before. What's going on? Thanks Maxim
Best way to implement indexing for high-cardinality values?
I now have a CF with extremely skinny rows (in the current implementation), and the application will want to query by more than one column values. Problem is that the values in a lot of cases will be high cardinality. One other factor is that I want to rotate data in and our of the system in one day buckets -- LILO in effect. The date will be on of the columns as well. I had 9 indexes in mind, but I think I can pare it down to 5. At least one of the column I will need to query by, has values that are guaranteed to be unique -- there are effectively two ways to identify data for very different part of the complete system. Indexing on that would be bad, wouldn't it? Any advice would be appreciated. Thanks Maxim
show schema bombs in 0.8.6
Running cli --debug: [default@PANDA] show schema; null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:310) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) Caused by: java.lang.NullPointerException at org.apache.cassandra.cli.CliClient.showColumnMeta(CliClient.java:1716) at org.apache.cassandra.cli.CliClient.showColumnFamily(CliClient.java:1686) at org.apache.cassandra.cli.CliClient.showKeyspace(CliClient.java:1636) at org.apache.cassandra.cli.CliClient.executeShowSchema(CliClient.java:1598) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:250)
Keys for deleted rows visible in CLI
Hello, I searched the archives and it appears that this question was once asked but was not answered. I just deleted a lot of rows, and want to list in cli. I still see the keys. This is not the same as getting slices, is it? Anyhow, what's the reason and rationale? I run 0.8.8. Thanks Maxim
Re: Keys for deleted rows visible in CLI
Thanks. It could be hidden from a human operator, I suppose :) On 12/13/2011 7:12 PM, Harold Nguyen wrote: Hi Maxim, The reason for this is because if node 1 goes down while you deleted information on node 2, node 1 will know not to repair the data when it comes back again. It will know that an operation has been performed to delete the data. Harold -Original Message- From: Maxim Potekhin [mailto:potek...@bnl.gov] Sent: Tuesday, December 13, 2011 4:03 PM To: user@cassandra.apache.org Subject: Keys for deleted rows visible in CLI Hello, I searched the archives and it appears that this question was once asked but was not answered. I just deleted a lot of rows, and want to list in cli. I still see the keys. This is not the same as getting slices, is it? Anyhow, what's the reason and rationale? I run 0.8.8. Thanks Maxim 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook
Deleted rows re-appearing on repair in 0,8.6
Hello, I know that this problem used to exist in 0.8.1 -- I delete rows, run a repair and these rows are back with a vengeance. I recall I was told that this was fixed in 0.8.6 -- is that the case? I still keep seeing that behavior. Thanks Maxim
Really old files in the data directory
Hello, a varied the GC grace a few times over the period of my cluster's lifetime, but I never went above 10 days. I did compactions, repairs etc. Now, I see that some files in the data directories of the nodes that were there from day one carry timestamps back from July. There are files containing secondary indexes. But I have deleted a large portion of the data, one would expect that these files must have been rebuilt one or many times. What's happening? I run 0.8.6. Thanks Maxim
Re: Cassandra 0.8.8
Hello everyone, so what's the update on 0.8.8? Many thanks Maxim On 12/2/2011 4:49 AM, Patrik Modesto wrote: Hi, It's been almost 2 months since the release of the 0.8.7 version and there are quite some changes in 0.8.8, so I'd like to ask is there a release date? Regards, Patrik
forceUserDefinedCompaction -- how to use it?
Can anyone provide an example of how to use forceUserDefinedCompaction? Thanks Maxim
Re: exporting data from Cassandra cluster
Hello Alexandru, as you probably know, my group is using Amazon S3 to permanently (or sem-permanently) park the data in CSV format, which makes it portable and we can load it into anything if needed, or analyze on its own. Just my half of a Swiss centime :) And, because the S3 option is not for everybody, and since you are at CERN, -- talk to data people in ATLAS. 350GB seems trivial. Regards Maxim On 12/7/2011 11:17 AM, Alexandru Dan Sicoe wrote: Hello everyone. 3 node Cassandra 0.8.5 cluster. I've left the system running in production environment for long term testing. I've accumulated about 350GB of data with RF=2. The machines I used for the tests are older and need to be replaced. Because of this I need to export the data to a permanent location. How should I export the data? In order to reduce the storage spac I want to export only the non-replicated data? I mean, just one copy of the data (without the replicas). Is this possible? How? Cheers, Alexandru
Cassandra behavior too fragile?
OK, thanks to the excellent help of Datastax folks, some of the more severe inconsistencies in my Cassandra cluster were fixed (after a node was down and compactions failed etc). I'm still having problems as reported in repairs 0.8.6. thread. Thing is, why is it so easy for the repair process to break? OK, I admit I'm not sure why nodes are reported as dead once in a while, but it's absolutely certain that they simply don't fall off the edge, are knocked out for 10 min or anything like that. Why is there no built-in tolerance/retry mechanism so that a node that may seem silent for a minute can be contacted later, or, better yet, a different node with a relevant replica is contacted? As was evident from some presentations at Cassandra-NYC yesterday, failed compactions and repairs are a major problem for a number of users. The cluster can quickly become unusable. I think it would be a good idea to build more robustness into these procedures, Regards Maxim
Re: Repair failure under 0.8.6
Basically I tweaked the phi, put in more verbose GC reporting and decided to do a compaction before I proceed. I'm getting this on the node where compaction is being run. And the system log for the other two nodes follows. It's obvious that the cluster is sick, but I can't determine why -- there are no overwhelming GC evidence as far as I can see. I didn't start compaction on node #3, somehow it attempts to do it anyhow. === Node #2 (compaction is being run): INFO [CompactionExecutor:2] 2011-12-05 14:19:36,741 CompactionManager.java (line 608) Compacted to /data/cassandra_data/data/system/LocationInfo-tmp-g-72-Data.db. 967 to 561 (~58% of original) bytes for 4 keys. Time: 71ms. INFO [main] 2011-12-05 14:19:36,941 Mx4jTool.java (line 67) mx4j successfuly loaded INFO [GossipStage:1] 2011-12-05 14:19:36,943 Gossiper.java (line 715) Node /130.199.185.193 has restarted, now UP again INFO [GossipStage:1] 2011-12-05 14:19:36,943 Gossiper.java (line 683) InetAddress /130.199.185.193 is now UP INFO [GossipStage:1] 2011-12-05 14:19:36,971 StorageService.java (line 819) Node /130.199.185.193 state jump to normal INFO [GossipStage:1] 2011-12-05 14:19:36,971 Gossiper.java (line 715) Node /130.199.185.195 has restarted, now UP again INFO [GossipStage:1] 2011-12-05 14:19:36,971 Gossiper.java (line 683) InetAddress /130.199.185.195 is now UP INFO [GossipStage:1] 2011-12-05 14:19:36,974 StorageService.java (line 819) Node /130.199.185.195 state jump to normal INFO [main] 2011-12-05 14:19:37,003 CassandraDaemon.java (line 115) Binding thrift service to cassandra02.usatlas.bnl.gov/130.199.185.194:9160 INFO [main] 2011-12-05 14:19:37,016 CassandraDaemon.java (line 124) Using TFastFramedTransport with a max frame size of 15728640 bytes. INFO [main] 2011-12-05 14:19:37,018 CassandraDaemon.java (line 151) Using synchronous/threadpool thrift server on cassandra02.usatlas.bnl.gov/130.199.185.194 : 9160 INFO [Thread-6] 2011-12-05 14:19:37,019 CassandraDaemon.java (line 203) Listening for thrift clients... INFO [GossipTasks:1] 2011-12-05 14:19:50,601 Gossiper.java (line 697) InetAddress /130.199.185.195 is now dead. ERROR [HintedHandoff:1] 2011-12-05 14:20:37,954 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /130.199.185.193 in 6ms at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Could not reach schema agreement with /130.199.185.193 in 6ms at org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:293) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:304) at org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89) at org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:397) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more ERROR [HintedHandoff:1] 2011-12-05 14:20:37,956 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[HintedHandoff:1,1,main] java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /130.199.185.193 in 6ms at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.RuntimeException: Could not reach schema agreement with /130.199.185.193 in 6ms at org.apache.cassandra.db.HintedHandOffManager.waitForSchemaAgreement(HintedHandOffManager.java:293) at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:304) at org.apache.cassandra.db.HintedHandOffManager.access$100(HintedHandOffManager.java:89) at org.apache.cassandra.db.HintedHandOffManager$2.runMayThrow(HintedHandOffManager.java:397) = Node #1 (nothing run) INFO [main] 2011-12-05 14:16:15,779 CassandraDaemon.java (line 115) Binding thrift service to cassandra01.usatlas.bnl.gov/130.199.185.193:9160 INFO [main] 2011-12-05 14:16:15,782 CassandraDaemon.java (line 124) Using TFastFramedTransport with a max frame size of 15728640 bytes. INFO [main] 2011-12-05 14:16:15,784 CassandraDaemon.java (line
Could not reach schema agreement... 0.8.6
Hello, upon startup, in my cluster of 3 machines, I see similar messages in system.log on each node (below). I start nodes one by one, after I ascertain the previous one is online. So they can't reach schema agreement, all of them. Why? No unusual load visible in Ganglia plots. ERROR [HintedHandoff:1] 2011-12-05 19:52:17,426 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Hint edHandoff:1,1,main] java.lang.RuntimeException: java.lang.RuntimeException: Could not reach schema agreement with /130.199.185.194 in 6ms at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Repair failure under 0.8.6
I capped heap and the error is still there. So I keep seeing node dead messages even when I know the nodes were OK. Where and how do I tweak timeouts? 9d-cfc9-4cbc-9f1d-1467341388b8, endpoint /130.199.185.193 died INFO [GossipStage:1] 2011-12-04 00:26:16,362 Gossiper.java (line 683) InetAddress /130.199.185.193 is now UP ERROR [AntiEntropySessions:1] 2011-12-04 00:26:16,518 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[Anti\ EntropySessions:1,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, \ endpoint /130.199.185.194 died at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Problem during repair session manual-repair-a6a655dc-63f0-4c1c-9c0b-0621f5692ba2, endpoint /130.199\ .185.194 died at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) On 12/3/2011 8:34 PM, Maxim Potekhin wrote: Thank you Peter. Before I look into details as you suggest, may I ask what you mean automatically restarted? They way the box and Cassandra are set up in my case is such that the death of either if final. Also, how do I look for full GC? I just realized that in the latest install, I might have omitted capping the heap size -- and the nodes have 48GB each. I guess this could be a problem, precipitating GC death, right? Thank you Maxim On 12/3/2011 7:46 PM, Peter Schuller wrote: quite understand how Cassandra declared a node dead (in the below). Was is a timeout? How do I fix that? I was about to respond to say that repair doesn't fail just due to failure detection, but this appears to have been broken by CASSANDRA-2433 :( Unless there is a subtle bug the exception you're seeing should be indicative that it really was considered Down by the node. You might grep the log for references ot the node in question (UP or DOWN) to confirm. The question is why though. I would check if the node has maybe automatically restarted, or went into full GC, etc.
Re: Repair failure under 0.8.6
Thanks Peter! I will try to increase phi_convict -- I will just need to restart the cluster after the edit, right? I do recall that I see nodes temporarily marked as down, only to pop up later. In the current situation, there is no load on the cluster at all, outside the maintenance like the repair. How do I configure the print level for the GC report? Thank you, Maxim On 12/4/2011 2:09 PM, Peter Schuller wrote: I capped heap and the error is still there. So I keep seeing node dead messages even when I know the nodes were OK. Where and how do I tweak timeouts? You can increase phi_convict_threshold in the configuration. However, I would rather want to find out why they are being marked as down to begin with. In a healthy situation, especially if you are not putting extreme load on the cluster, there is very little reason for hosts to be marked as down unless there's some bug somewhere. Is this cluster under constant traffic? Are you seeing slow requests from the point of view of the client (indicating that some requests are routed to nodes that are temporarily inaccessible)? With respect to GC, I would recommend running with -XX:+PrintGC and -XX:PrintGCDetails and -XX:+PrintGCTimeStamps and -XX:+PrintGCDateStamps and then look at the system log. A fallback to full GC should be findable by grepping for Full. Also, is this a problem with one specific host, or is it happening to all hosts every now and then? And I mean either the host being flagged as down, or the host that is flagging others as down. As for uncapped heap: Generally a larger heap is not going to make it more likely to fall back to full GC; usually the opposite is true. However, a larger heap can make some of the non-full GC pauses longer, depending. In either case, r unning with the above GC options will give you specific information on GC pauses and should allow you to rule that out (or not).
Re: Repair failure under 0.8.6
Please disregard the GC part of the question -- I found it. On 12/4/2011 4:12 PM, Maxim Potekhin wrote: Thanks Peter! I will try to increase phi_convict -- I will just need to restart the cluster after the edit, right? I do recall that I see nodes temporarily marked as down, only to pop up later. In the current situation, there is no load on the cluster at all, outside the maintenance like the repair. How do I configure the print level for the GC report? Thank you, Maxim On 12/4/2011 2:09 PM, Peter Schuller wrote: I capped heap and the error is still there. So I keep seeing node dead messages even when I know the nodes were OK. Where and how do I tweak timeouts? You can increase phi_convict_threshold in the configuration. However, I would rather want to find out why they are being marked as down to begin with. In a healthy situation, especially if you are not putting extreme load on the cluster, there is very little reason for hosts to be marked as down unless there's some bug somewhere. Is this cluster under constant traffic? Are you seeing slow requests from the point of view of the client (indicating that some requests are routed to nodes that are temporarily inaccessible)? With respect to GC, I would recommend running with -XX:+PrintGC and -XX:PrintGCDetails and -XX:+PrintGCTimeStamps and -XX:+PrintGCDateStamps and then look at the system log. A fallback to full GC should be findable by grepping for Full. Also, is this a problem with one specific host, or is it happening to all hosts every now and then? And I mean either the host being flagged as down, or the host that is flagging others as down. As for uncapped heap: Generally a larger heap is not going to make it more likely to fall back to full GC; usually the opposite is true. However, a larger heap can make some of the non-full GC pauses longer, depending. In either case, r unning with the above GC options will give you specific information on GC pauses and should allow you to rule that out (or not).
Re: can not create a column family named 'index'
I seem to recall problems when using a cf called indexRegistry, don't remember much detail now. Maxim On 11/30/2011 7:24 PM, Shu Zhang wrote: Hi, just wondering if this is intentional: [default@test] create column family index; Syntax error at position 21: mismatched input 'index' expecting set null [default@test] create column family idx; b9aae960-1bb2-11e1--bf27a177f2f6 Waiting for schema agreement... ... schemas agree across the cluster Thanks, Shu
Re: Repair failure under 0.8.6
As a side effect of the failed repair (so it seems) the disk usage on the affected node prevents compaction from working. It still works on the remaining nodes (we have 3 total). Is there a way to scrub the extraneous data? Thanks Maxim On 12/4/2011 4:29 PM, Peter Schuller wrote: I will try to increase phi_convict -- I will just need to restart the cluster after the edit, right? You will need to restart the nodes for which you want the phi convict threshold to be different. You might want to do on e.g. half of the cluster to do A/B testing. I do recall that I see nodes temporarily marked as down, only to pop up later. I recommend grepping through the logs on all the clusters (e.g., cat /var/log/cassandra/cassandra.log | grep UP | wc -l). That should tell you quickly whether they all seem to be seeing roughly as many node flaps, or whether some particular node or set of nodes is/are over-represented. Next, look at the actual nodes flapping (remove wc -l) and see if all nodes are flapping or if it is a single node, or a subset of the nodes (e.g., sharing a switch perhaps). In the current situation, there is no load on the cluster at all, outside the maintenance like the repair. Ok. So what i'm getting at then is that there may be real legitimate connectivity problems that you aren't noticing in any other way since you don't have active traffic to the cluster.
Repair failure under 0.8.6
Please help -- I've been having pretty consistent failures that look like this one. Don't know how to proceed. Below text comes from the system log. The cluster was all up before and after the attempted repair, so I don't quite understand how Cassandra declared a node dead (in the below). Was is a timeout? How do I fix that? Thanks, Maxim INFO [GossipStage:1] 2011-12-02 17:12:07,293 Gossiper.java (line 683) InetAddress /130.199.185.194 is now UP ERROR [AntiEntropySessions:1] 2011-12-02 17:12:07,354 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[AntiEntropySessions:1,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-618fad49-387f-44df-a25e-aa57b314768a, endpoint /130.199.185.194 died at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Problem during repair session manual-repair-618fad49-387f-44df-a25e-aa57b314768a, endpoint /130.199.185.194 died at org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(AntiEntropyService.java:712) at org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(AntiEntropyService.java:749) at org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:155) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:527) at org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) at org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204) ... 3 more INFO [AntiEntropyStage:1] 2011-12-02 17:12:07,392 AntiEntropyService.java (line 215) Sending AEService tree for #TreeRequest manual-repair-c721c217-4b70-4a15-91fc-374b39b8b05\ 3, cassandra03.usatlas.bnl.gov/130.199.185.195, (PANDA,files), (56713727820156410577229101238628035242,113427455640312821154458202477256070484]
Re: Repair failure under 0.8.6
Thank you Peter. Before I look into details as you suggest, may I ask what you mean automatically restarted? They way the box and Cassandra are set up in my case is such that the death of either if final. Also, how do I look for full GC? I just realized that in the latest install, I might have omitted capping the heap size -- and the nodes have 48GB each. I guess this could be a problem, precipitating GC death, right? Thank you Maxim On 12/3/2011 7:46 PM, Peter Schuller wrote: quite understand how Cassandra declared a node dead (in the below). Was is a timeout? How do I fix that? I was about to respond to say that repair doesn't fail just due to failure detection, but this appears to have been broken by CASSANDRA-2433 :( Unless there is a subtle bug the exception you're seeing should be indicative that it really was considered Down by the node. You might grep the log for references ot the node in question (UP or DOWN) to confirm. The question is why though. I would check if the node has maybe automatically restarted, or went into full GC, etc.
How many indexes to keep? Guidelines
As a matter of practice, how many secondary indexes on a CF do you usually keep? What are rules of thumb? Is 10 too many? 100? 1000? Thanks Maxim
Re: Yanking a dead node
Thanks! Looks pretty obvious in retrospect... Regards, Maxim On 11/24/2011 6:54 AM, Filipe Gonçalves wrote: Just remove its token from the ring using nodetool removetokentoken 2011/11/23 Maxim Potekhinpotek...@bnl.gov: This was discussed a long time ago, but I need to know what's the state of the art answer to that: assume one of my few nodes is very dead. I have no resources or time to fix it. Data is replicated so the data is still available in the cluster. How do I completely remove the dead node without having to rebuild it, repair, drain and decommission? TIA Maxim
Yanking a dead node
This was discussed a long time ago, but I need to know what's the state of the art answer to that: assume one of my few nodes is very dead. I have no resources or time to fix it. Data is replicated so the data is still available in the cluster. How do I completely remove the dead node without having to rebuild it, repair, drain and decommission? TIA Maxim
7199
Hello, I have this in my cassandra-env.sh JMX_PORT=7199 Does this mean that if I use nodetool from another node, it will try to connect to that particular port? Thanks, Maxim
Re: 7199
Thanks. I'm trying to look up HttpAdaptor and what it does, can you give any pointers? Thanks. I didn't find much useful info just yet. Maxim On 11/22/2011 9:52 PM, Jeremiah Jordan wrote: Yes, that is the port nodetool needs to access. On Nov 22, 2011, at 8:43 PM, Maxim Potekhin wrote: Hello, I have this in my cassandra-env.sh JMX_PORT=7199 Does this mean that if I use nodetool from another node, it will try to connect to that particular port? Thanks, Maxim
Re: read performance problem
Try to see if there is a lot of paging going on, and run some benchmarks on the disk itself. Are you running Windows or Linux? Do you think the disk may be fragmented? Maxim On 11/19/2011 8:58 PM, Kent Tong wrote: Hi, On my computer with 2G RAM and a core 2 duo CPU E4600 @ 2.40GHz, I am testing the performance of Cassandra. The write performance is good: It can write a million records in 10 minutes. However, the query performance is poor and it takes 10 minutes to read 10K records with sequential keys from 0 to (about 100 QPS). This is far away from the 3,xxx QPS found on the net. Cassandra decided to use 1G as the Java heap size which seems to be fine as at the end of the benchmark the swap was barely used (only 1M used). I understand that my computer may be not as powerful as those used in the other benchmarks, but it shouldn't be that far off (1:30), right? Any suggestion? Thanks in advance!
A Cassandra CLI question: null vs 0 rows
Hello everyone, I run a query on a secondary index. For some queries, I get 0 rows returned. In other cases, I just get a string that reads null. What's going on? TIA Maxim
Re: A Cassandra CLI question: null vs 0 rows
Thanks Jonathan. I get the bellow error. Don't have a clue as to what it means. null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:310) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) Caused by: java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeGetWithConditions(CliClient.java:814) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:208) ... 2 more On 11/17/2011 12:28 PM, Jonathan Ellis wrote: If CLI returns null it means there was an error -- run with --debug to check the exception. On Thu, Nov 17, 2011 at 11:20 AM, Maxim Potekhinpotek...@bnl.gov wrote: Hello everyone, I run a query on a secondary index. For some queries, I get 0 rows returned. In other cases, I just get a string that reads null. What's going on? TIA Maxim
What sort of load do the tombstones create on the cluster?
In view of my unpleasant discovery last week that deletions in Cassandra lead to a very real and serious performance loss, I'm working on a strategy of moving forward. If the tombstones do cause such problem, where should I be looking for performance bottlenecks? Is it disk, CPU or something else? Thing is, I don't see anything outstanding in my Ganglia plots. TIA, Maxim
Varying number of rows coming from same query on same database
Hello, I'm running the same query repeatedly. It's a secondary index query, done from a Pycassa client. I see that when I iterate the result object, I get slightly different number of entries when running the test serially. There is no deletions in the database, and no writes, it's static for now. Any comments will be appreciated. Maxim
Re: Data Model Design for Login Servie
1122: { gender: MALE birthdate: 1987.11.09 name: Alfred Tester pwd: e72c504dc16c8fcd2fe8c74bb492affa alias1: alfred.tes...@xyz.de mailto:alfred.tes...@xyz.de alias2: alf...@aad.de mailto:alf...@aad.de alias3: a...@dd.de mailto:a...@dd.de } ...and you can use secondary indexes to query on anything. Maxim On 11/17/2011 4:08 PM, Maciej Miklas wrote: Hallo all, I need your help to design structure for simple login service. It contains about 100.000.000 customers and each one can have about 10 different logins - this results 1.000.000.000 different logins. Each customer contains following data: - one to many login names as string, max 20 UTF-8 characters long - ID as long - one customer has only one ID - gender - birth date - name - password as MD5 Login process needs to find user by login name. Data in Cassandra is replicated - this is necessary to obtain all required login data in single call. Also usually we expect low write traffic and heavy read traffic - round trips for reading data should be avoided. Below I've described two possible cassandra data models based on example: we have two users, first user has two logins and second user has three logins A) Skinny rows - row key contains login name - this is the main search criteria - login data is replicated - each possible login is stored as single row which contains all user data - 10 logins for single customer create 10 rows, where each row has different key and the same content // first 3 rows has different key and the same replicated data alfred.tes...@xyz.de mailto:alfred.tes...@xyz.de { id: 1122 gender: MALE birthdate: 1987.11.09 name: Alfred Tester pwd: e72c504dc16c8fcd2fe8c74bb492affa }, alf...@aad.de mailto:alf...@aad.de { id: 1122 gender: MALE birthdate: 1987.11.09 name: Alfred Tester pwd: e72c504dc16c8fcd2fe8c74bb492affa }, a...@dd.de mailto:a...@dd.de { id: 1122 gender: MALE birthdate: 1987.11.09 name: Alfred Tester pwd: e72c504dc16c8fcd2fe8c74bb492affa }, // two following rows has again the same data for second customer manf...@xyz.de mailto:manf...@xyz.de { id: 1133 gender: MALE birthdate: 1997.02.01 name: Manfredus Maximus pwd: e44c504ff16c8fcd2fe8c74bb492adda }, rober...@xyz.de mailto:rober...@xyz.de { id: 1133 gender: MALE birthdate: 1997.02.01 name: Manfredus Maximus pwd: e44c504ff16c8fcd2fe8c74bb492adda } B) Rows grouped by alphabetical prefix - Number of rows is limited - for example first letter from login name - Each row contains all logins which benign with row key - row with key 'a' contains all logins which begin with 'a' - Data might be unbalanced, but we avoid skinny rows - this might have positive performance impact (??) - to avoid super columns each row contains directly columns, where column name is the user login and column value is corresponding data in kind of serialized form (I would like to have is human readable) a { alfred.tes...@xyz.de mailto:alfred.tes...@xyz.de:1122;MALE;1987.11.09; Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa, alf...@aad.de@xyz.de http://xyz.de:1122;MALE;1987.11.09; Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa, a...@dd.de@xyz.de http://xyz.de:1122;MALE;1987.11.09; Alfred Tester;e72c504dc16c8fcd2fe8c74bb492affa }, m { manf...@xyz.de mailto:manf...@xyz.de:1133;MALE;1997.02.01; Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda }, r { rober...@xyz.de mailto:rober...@xyz.de:1133;MALE;1997.02.01; Manfredus Maximus;e44c504ff16c8fcd2fe8c74bb492adda } Which solution is better, especially for better read performance? Do you have better idea? Thanks, Maciej
Re: A Cassandra CLI question: null vs 0 rows
Should I file a ticket? I consistently see this behavior after a mass delete. On 11/17/2011 12:46 PM, Maxim Potekhin wrote: Thanks Jonathan. I get the bellow error. Don't have a clue as to what it means. null java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:310) at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217) at org.apache.cassandra.cli.CliMain.main(CliMain.java:345) Caused by: java.lang.RuntimeException at org.apache.cassandra.cli.CliClient.executeGetWithConditions(CliClient.java:814) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:208) ... 2 more On 11/17/2011 12:28 PM, Jonathan Ellis wrote: If CLI returns null it means there was an error -- run with --debug to check the exception. On Thu, Nov 17, 2011 at 11:20 AM, Maxim Potekhinpotek...@bnl.gov wrote: Hello everyone, I run a query on a secondary index. For some queries, I get 0 rows returned. In other cases, I just get a string that reads null. What's going on? TIA Maxim
Re: Mass deletion -- slowing down
Thanks for the note. Ideally I would not like to keep track of what is the oldest indexed date, because this means that I'm already creating a bit of infrastructure on top of my database, with attendant referential integrity problems. But I suppose I'll be forced to do that. In addition, I'll have to wait until the grace period is over and compact, removing the tombstones and finally clearing the disk (which is what I need to do in the first place). Frankly, this whole situation for me illustrates a very real deficiency in Cassandra -- one would think that deleting less than one percent of data shouldn't really lead to complete failures in certain indexed queries. That's bad. Maxim On 11/14/2011 3:01 AM, Guy Incognito wrote: i think what he means is...do you know what day the 'oldest' day is? eg if you have a rolling window of say 2 weeks, structure your query so that your slice range only goes back 2 weeks, rather than to the beginning of time. this would avoid iterating over all the tombstones from prior to the 2 week window. this wouldn't work if you are deleting arbitrary days in the middle of your date range. On 14/11/2011 02:02, Maxim Potekhin wrote: Thanks Peter, I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically newer entries might have a smaller sequential ID. That's because the timestamp corresponds to the last update that's stochastic in the sense that the jobs can take from seconds to days to complete. As I said I'm not sure I understood you correctly. Also, I note that queries on different dates (i.e. not contaminated with lots of tombstones) work just fine, which is consistent with the picture that emerged so far. Theoretically -- would compaction or cleanup help? Thanks Maxim On 11/13/2011 8:39 PM, Peter Schuller wrote: I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the oldest might be) so that you can bound your index lookup= that value (and avoid the tombstones)?
Re: Mass deletion -- slowing down
I've done more experimentation and the behavior persists: I start with a normal dataset which is searcheable by a secondary index. I select by that index the entries that match a certain criterion, then delete those. I tried two methods of deletion -- individual cf.remove() as well as batch removal in Pycassa. What happens after that is as follows: attempts to read the same CF, using the same index values start to time out in the Pycassa client (there is a thrift message about timeout). The entries not touched by such attempted deletion are read just fine still. Has anyone seen such behavior? Thanks, Maxim On 11/10/2011 8:30 PM, Maxim Potekhin wrote: Hello, My data load comes in batches representing one day in the life of a large computing facility. I index the data by the day it was produced, to be able to quickly pull data for a specific day within the last year or two. There are 6 other indexes. When it comes to retiring the data, I intend to delete it for the oldest date and after that add a fresh batch of data, so I control the disk space. Therein lies a problem -- and it maybe Pycassa related, so I also filed an issue on github -- then I select by 'DATE=blah' and then do a batch remove, it works fine for a while, and then after a few thousand deletions (done in batches of 1000) it grinds to a halt, i.e. I can no longer iterate the result, which manifests in a timeout error. Is that a behavior seen before? Cassandra version is 0.8.6, Pycassa 1.3.0. TIA, Maxim
Re: Mass deletion -- slowing down
Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution. b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's the right operation procedure? For some reason I can still select on the index in the CLI, it's the Pycassa module that gives me trouble, but I need it as this is my platform and we are a Python shop. Maxim On 11/13/2011 7:22 PM, Peter Schuller wrote: Deletions in Cassandra imply the use of tombstones (see http://wiki.apache.org/cassandra/DistributedDeletes) and under some circumstances reads can turn O(n) with respect to the amount of columns deleted, depending. It sounds like this is what you're seeing. For example, suppose you're inserting a range of columns into a row, deleting it, and inserting another non-overlapping subsequent range. Repeat that a bunch of times. In terms of what's stored in Cassandra for the row you now have: tomb tomb tomb tomb actual data If you then do something like a slice on that row with the end-points being such that they include all the tombstones, Cassandra essentially has to read through and process all those tombstones (for the PostgreSQL aware: this is similar to the effect you can get if implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect to the number of deleted entries until the last vacuum - improved in modern versions)).
Re: Mass deletion -- slowing down
Brandon, thanks for the note. Each row represents a computational task (a job) executed on the grid or in the cloud. It naturally has a timestamp as one of its attributes, representing the time of the last update. This timestamp is used to group the data into buckets each representing one day in the system's activity. I create the DATE attribute and add it to each row, e.g. it's a column {'DATE','2013'}. I create an index on that column, along with a few others. Now, I want to rotate the data out of my database, on daily basis. For that, I need to select on 'DATE' and then do a delete. I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, it's just the indexed queries that start to time out. I changed timeouts and number of retries in the Pycassa pool, but that doesn't seem to help. Thanks, Maxim On 11/13/2011 8:00 PM, Brandon Williams wrote: On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhinpotek...@bnl.gov wrote: Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution. b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's the right operation procedure? I'd have to know more about what your access pattern is like to give you a fully informed answer. For some reason I can still select on the index in the CLI, it's the Pycassa module that gives me trouble, but I need it as this is my platform and we are a Python shop. This seems odd, since the rpc_timeout is the same for all clients. Maybe pycassa is asking for more data than the cli? -Brandon
Re: Mass deletion -- slowing down
Brandon, it won't work in my application, as I need a few indexes on attributes of the job. In addition, a large portion of queries is based on key-value lookup, and that key is the unique job ID. I really can't have data packed in one row per day. Thanks, Maxim On 11/13/2011 8:34 PM, Brandon Williams wrote: On Sun, Nov 13, 2011 at 7:25 PM, Maxim Potekhinpotek...@bnl.gov wrote: Each row represents a computational task (a job) executed on the grid or in the cloud. It naturally has a timestamp as one of its attributes, representing the time of the last update. This timestamp is used to group the data into buckets each representing one day in the system's activity. I create the DATE attribute and add it to each row, e.g. it's a column {'DATE','2013'}. Hmm, so why is pushing this into the row key and then deleting the entire row not acceptable? (this is what the link I gave would prescribe) In other words, you bucket at the row level, instead of relying on a column attribute that needs an index. -Brandon
Re: Mass deletion -- slowing down
Thanks Peter, I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically newer entries might have a smaller sequential ID. That's because the timestamp corresponds to the last update that's stochastic in the sense that the jobs can take from seconds to days to complete. As I said I'm not sure I understood you correctly. Also, I note that queries on different dates (i.e. not contaminated with lots of tombstones) work just fine, which is consistent with the picture that emerged so far. Theoretically -- would compaction or cleanup help? Thanks Maxim On 11/13/2011 8:39 PM, Peter Schuller wrote: I do limit the number of rows I'm asking for in Pycassa. Queries on primary keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the oldest might be) so that you can bound your index lookup= that value (and avoid the tombstones)?
Mass deletion -- slowing down
Hello, My data load comes in batches representing one day in the life of a large computing facility. I index the data by the day it was produced, to be able to quickly pull data for a specific day within the last year or two. There are 6 other indexes. When it comes to retiring the data, I intend to delete it for the oldest date and after that add a fresh batch of data, so I control the disk space. Therein lies a problem -- and it maybe Pycassa related, so I also filed an issue on github -- then I select by 'DATE=blah' and then do a batch remove, it works fine for a while, and then after a few thousand deletions (done in batches of 1000) it grinds to a halt, i.e. I can no longer iterate the result, which manifests in a timeout error. Is that a behavior seen before? Cassandra version is 0.8.6, Pycassa 1.3.0. TIA, Maxim
Is there a way to get only keys with get_indexed_slices?
Is there a way to get only keys with get_indexed_slices? Looking at the code, it's not possible, but -- is there some way anyhow? I don't want to extract any data, just a list of matching keys. TIA, Maxim
Error connection to remote JMX agent during repair
Hello, I'm trying to run repair on one of my nodes which needs to be repopulated after a failure of the hard drive. What I'm getting is below. Note: I'm not loading JMX with Cassandra, it always worked before... The version if 0.8.6. Any help will be appreciated, Maxim Error connection to remote JMX agent! java.io.IOException: Failed to retrieve RMIServer stub: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out] at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:338) at javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:140) at org.apache.cassandra.tools.NodeProbe.init(NodeProbe.java:110) at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:582) Caused by: javax.naming.CommunicationException [Root exception is java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out] at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101) at com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185) at javax.naming.InitialContext.lookup(InitialContext.java:392) at javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1886) at javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1856) at javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257) ... 4 more Caused by: java.rmi.ConnectIOException: error during JRMP connection establishment; nested exception is: java.net.SocketTimeoutException: Read timed out at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286) at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322) at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) at com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97) ... 9 more Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readByte(DataInputStream.java:248) at sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228)
Re: Tool for SQL - Cassandra data movement
Just a short comment -- we are going the CSV way as well because of its compactness and extreme portability. The CSV files are kept in the cloud as backup. They can also find other uses. JSON would work as well, but it would be at least twice as large in size. Maxim On 9/22/2011 1:25 PM, Nehal Mehta wrote: We are trying to carry out same stuff, but instead of migrating into JSON, we are exporting into CSV and than importing CSV into Cassandra. Which DB are you currently using? Thanks, Nehal Mehta. 2011/9/22 Radim Kolar h...@sendmail.cz mailto:h...@sendmail.cz I need tool which is able to dump tables via JDBC into JSON format for cassandra import. I am pretty sure that somebody already wrote that. Are there tools which can do direct JDBC - cassandra import?
Re: CMS GC initial-mark taking 6 seconds , bad?
Hello Aaron, I happen to have 48GB on each machines I use in the cluster. Can I assume that I can't really use all of this memory productively? Do you have any suggestion related to that? Can I run more than one instance on Cassandra on the same box (using different ports) to take advantage of this memory, assuming the disk has enough bandwidth? Thanks, Maxim On 9/25/2011 11:37 AM, aaron morton wrote: It does seem long and will be felt by your application. Are you running a 47GB heap ? Most peeps seem to think 8 to 12 is about the viable maximum. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 25/09/2011, at 7:14 PM, Yang wrote: I see the following in my GC log 1910.513: [GC [1 CMS-initial-mark: 2598619K(26214400K)] 13749939K(49807360K), 6.0696680 secs] [Times: user=6.10 sys=0.00, real=6.07 secs] so there is a stop-the-world period of 6 seconds. does this sound bad ? or 6 seconds is OK and we should expect the built-in fault-tolerance of Cassandra handle this? Thanks Yang
Re: hw requirements
Sorry about unclear naming scheme. I meant that if I want to index on a few columns simultaneously, I create a new column with catenated values of these. On 8/31/2011 3:10 PM, Anthony Ikeda wrote: Sorry to fork this topic, but in composite indexes do you mean as strings or as Composite(). I only ask cause we have started using the Composite as rowkeys and column names to replace the use of concatenated strings mainly for lookup purposes. Anthony On Wed, Aug 31, 2011 at 10:27 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Plenty of comments in this thread already, and I agree with those saying it depends. From my experience, a cluster with 18 spindles total could not match the performance and throughput of our primary Oracle server which had 108 spindles. After we upgraded to SSD, things have definitely changed for the better, for Cassandra. Another thing is that if you plan to implement composite indexes by catenating column values into additional columns, that would constitute a write hence you'll need CPU. So watch out. On 8/29/2011 9:15 AM, Helder Oliveira wrote: Hello guys, What is the type of profile of a cassandra server. Are SSD an option ? Does cassandra needs better CPU ou lots of memory ? Are SATA II disks ok ? I am making some tests, and i started evaluating the possible hardware. If someone already has conclusions about it, please share :D Thanks a lot.
Re: hw requirements
Plenty of comments in this thread already, and I agree with those saying it depends. From my experience, a cluster with 18 spindles total could not match the performance and throughput of our primary Oracle server which had 108 spindles. After we upgraded to SSD, things have definitely changed for the better, for Cassandra. Another thing is that if you plan to implement composite indexes by catenating column values into additional columns, that would constitute a write hence you'll need CPU. So watch out. On 8/29/2011 9:15 AM, Helder Oliveira wrote: Hello guys, What is the type of profile of a cassandra server. Are SSD an option ? Does cassandra needs better CPU ou lots of memory ? Are SATA II disks ok ? I am making some tests, and i started evaluating the possible hardware. If someone already has conclusions about it, please share :D Thanks a lot.
Re: Repair taking a long, long time
I can re-load all data that I have in the cluster, from a flat-file cache I have on NFS, many times faster than the nodetool repair takes. And that's not even accurate because as other noted nodetool repair eats up disk space for breakfast and takes more than 24hrs on 200GB data load, at which point I have to cancel. That's not acceptable. I simply don't know what to do now. On 7/20/2011 8:47 AM, David Boxenhorn wrote: I have this problem too, and I don't understand why. I can repair my nodes very quickly by looping though all my data (when you read your data it does read-repair), but nodetool repair takes forever. I understand that nodetool repair builds merkle trees, etc. etc., so it's a different algorithm, but why can't nodetool repair be smart enough to choose the best algorithm? Also, I don't understand what's special about my data that makes nodetool repair so much slower than looping through all my data. On Wed, Jul 20, 2011 at 12:18 AM, Maxim Potekhin potek...@bnl.gov mailto:potek...@bnl.gov wrote: Thanks Edward. I'm told by our IT that the switch connecting the nodes is pretty fast. Seriously, in my house I copy complete DVD images from my bedroom to the living room downstairs via WiFi, and a dozen of GB does not seem like a problem, on dirt cheap hardware (Patriot Box Office). I also have just _one_ column major family but caveat emptor -- 8 indexes attached to it (and there will be more). There is one accounting CF which is small, can't possibly make a difference. By contrast, compaction (as in nodetool) performs quite well on this cluster. I start suspecting some sort of malfunction. Looked at the system log during the repair, there is some compaction agent doing work that I'm not sure makes sense (and I didn't call for it). Disk utilization all of a sudden goes up to 40% per Ganglia, and stays there, this is pretty silly considering the cluster is IDLE and we have SSDs. No external writes, no reads. There are occasional GC stoppages, but these I can live with. This repair debacle happens 2nd time in a row. Cr@p. I need to go to production soon and that doesn't look good at all. If I can't manage a system that simple (and/or get help on this list) I may have to cut losses i.e. stay with Oracle. Regards, Maxim On 7/19/2011 12:16 PM, Edward Capriolo wrote: Well most SSD's are pretty fast. There is one more to consider. If Cassandra determines nodes are out of sync it has to transfer data across the network. If that is the case you have to look at 'nodetool streams' and determine how much data is being transferred between nodes. There are some open tickets where with larger tables repair is streaming more then it needs to. But even if the transfers are only 10% of your 200GB. Transferring 20 GB is not trivial. If you have multiple keyspaces and column families repair one at a time might make the process more manageable.
Repair taking a long, long time
We have something of the order of 200GB load on each of 3 machines in a balanced cluster under 0.8.1. I started repair about 24hrs ago and did some moderate amount of inserts since then (a small fraction of data load). The repair still appears to be running. What could go wrong? Thanks, Maxim