nodetool repair with vnodes
Greetings. I'm trying to run nodetool repair on a Cassandra 1.2.1 cluster of 3 nodes with 256 vnodes each. On a pre-1.2 cluster I used to launch a nodetool repair on every node every 24hrs. Now I'm getting a differenf behavior, and I'm sure I'm missing something. What I see on the command line is: [2013-02-17 10:20:15,186] Starting repair command #1, repairing 768 ranges for keyspace goh_master [2013-02-17 10:48:13,401] Repair session 3d140e10-78e3-11e2-af53-d344dbdd69f5 for range (6556914650761469337,6580337080281832001] finished (…repeat the last line 767 times) …so it seems to me that it is running on all vnodes ranges. Also, whatever the node which I launch the command on is, only one node log is moving and is always the same node. So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. I'm sure I'm making some mistakes, and I just can't find any clue of what's wrong with my nodetool usage on the documentation (if anything is wrong, btw). Is there anything I'm missing ? -- Marco Matarazzo
Re: Size Tiered - Leveled Compaction
Hello Wei, First thanks for this response. Out of curiosity, what SSTable size did you choose for your usecase, and what made you decide on that number? Thanks, -Mike On 2/14/2013 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei *From:* Charles Brophy cbro...@zulily.com *To:* user@cassandra.apache.org *Sent:* Thursday, February 14, 2013 8:29 AM *Subject:* Re: Size Tiered - Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com mailto:mthero...@yahoo.com wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered - Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running nodetool sstableupgrade on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Re: virtual nodes + map reduce = too many mappers
Thanks Eric for the appreciation :) Default split size is 64K rows. ColumnFamilyInputFormat first collects all tokens and create a split for each. if you have 256 vnode for each node that it creates 256 splits even if you have no data at all. current split size will only work if you have a vnode that has more than 64K rows. Possible solution that came to my mind: We can simply extend ColumnFamilySplit by adding a list of token ranges instead of one. Than no need create mapper for each token. Each mapper can do multiple range queries. But I don't know how to combine the range queries because in the typical range query you need to set start and end token. But in the virtual nodes I realized that tokens are not continuous. Best Regards, Cem On Sun, Feb 17, 2013 at 2:47 AM, Edward Capriolo edlinuxg...@gmail.comwrote: Split size does not have to equal block size. http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html An abstract InputFormat that returns CombineFileSplit's in InputFormat.getSplits(JobConf, int) method. Splits are constructed from the files under the input paths. A split cannot have files from different pools. Each split returned may contain blocks from different files. If a maxSplitSize is specified, then blocks on the same node are combined to form a single split. Blocks that are left over are then combined with other blocks in the same rack. If maxSplitSize is not specified, then blocks from the same rack are combined in a single split; no attempt is made to create node-local splits. If the maxSplitSize is equal to the block size, then this class is similar to the default spliting behaviour in Hadoop: each block is a locally processed split. Subclasses implement InputFormat.getRecordReader(InputSplit, JobConf, Reporter) to construct RecordReader's for CombineFileSplit's. Hive offers a CombinedHiveInputFormat https://issues.apache.org/jira/browse/HIVE-74 Essentially Combined input formats rock hard. If you have a directory with say 2000 files, you do not want 2000 splits, and then the overhead of starting stopping 2000 mappers. If you enable CombineInputFormat you can tune mapred.split.size and the number of mappers is based (mostly) on the input size. This gives jobs that would create too many map tasks way more throughput, and stops them from monopolizing the map slots on the cluster. It would seem like all the extra splits from the vnode change could be combined back together. On Sat, Feb 16, 2013 at 8:21 PM, Jonathan Ellis jbel...@gmail.com wrote: Wouldn't you have more than 256 splits anyway, given a normal amount of data? (Default split size is 64k rows.) On Fri, Feb 15, 2013 at 7:01 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Seems like the hadoop Input format should combine the splits that are on the same node into the same map task, like Hadoop's CombinedInputFormat can. I am not sure who recommends vnodes as the default, because this is now the second problem (that I know of) of this class where vnodes has extra overhead, https://issues.apache.org/jira/browse/CASSANDRA-5161 This seems to be the standard operating practice in c* now, enable things in the default configuration like new partitioners and newer features like vnodes, even though they are not heavily tested in the wild or well understood, then deal with fallout. On Fri, Feb 15, 2013 at 11:52 AM, cem cayiro...@gmail.com wrote: Hi All, I have just started to use virtual nodes. I set the number of nodes to 256 as recommended. The problem that I have is when I run a mapreduce job it creates node * 256 mappers. It creates node * 256 splits. this effects the performance since the range queries have a lot of overhead. Any suggestion to improve the performance? It seems like I need to lower the number of virtual nodes. Best Regards, Cem -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced
Re: NPE in running ClientOnlyExample
This is a bad example to follow. This is the internal client the Cassandra nodes use to talk to each other (fat client) usually you do not use this unless you want to write some embedded code on the Cassandra server. Typically clients use thrift/native transport. But you are likely getting the error you are seeing because the keyspace or column family is not created yet. On Sat, Feb 16, 2013 at 11:41 PM, Jain Rahul ja...@ivycomptech.com wrote: Hi All, I am newbie to Cassandra and trying to run an example program “ClientOnlyExample” taken from https://raw.github.com/apache/cassandra/cassandra-1.2/examples/client_only/src/ClientOnlyExample.java. But while executing the program it gives me a null pointer exception. Can you guys please help me out what I am missing. I am using Cassandra 1.2.1 version. I have pasted the logs at http://pastebin.com/pmADWCYe Exception in thread main java.lang.NullPointerException at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:71) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:66) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:61) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:56) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:183) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:204) at ClientOnlyExample.testWriting(ClientOnlyExample.java:78) at ClientOnlyExample.main(ClientOnlyExample.java:135) Regards, Rahul This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.
Re: Deleting old items during compaction (WAS: Deleting old items)
That's what the TTL does. Manually delete all the older data now, then start using TTL. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 11:08 PM, Ilya Grebnov i...@metricshub.com wrote: Hi, We looking for solution for same problem. We have a wide column family with counters and we want to delete old data like 1 months old. One of potential ideas was to implement hook in compaction code and drop column which we don’t need. Is this a viable option? Thanks, Ilya From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Tuesday, February 12, 2013 9:01 AM To: user@cassandra.apache.org Subject: Re: Deleting old items So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: Mutation dropped
You are hitting the maximum throughput on the cluster. The messages are dropped because the node fails to start processing them before rpc_timeout. However the request is still a success because the client requested CL was achieved. Testing with RF 2 and CL 1 really just tests the disks on one local machine. Both nodes replicate each row, and writes are sent to each replica, so the only thing the client is waiting on is the local node to write to it's commit log. Testing with (and running in prod) RF3 and CL QUROUM is a more real world scenario. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote: Hi – Is there a parameter which can be tuned to prevent the mutations from being dropped ? Is this logic correct ? Node A and B with RF=2, CL =1. Load balanced between the two. -- Address Load Tokens Owns (effective) Host ID Rack UN 10.x.x.x 746.78 GB 256 100.0% dbc9e539-f735-4b0b-8067-b97a85522a1a rack1 UN 10.x.x.x 880.77 GB 256 100.0% 95d59054-be99-455f-90d1-f43981d3d778 rack1 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start falling behind and we see the mutation dropped messages. But there are no failures on the client. Does that mean other node is not able to persist the replicated data ? Is there some timeout associated with replicated data persistence ? Thanks, Kanwar From: Kanwar Sangha [mailto:kan...@mavenir.com] Sent: 14 February 2013 09:08 To: user@cassandra.apache.org Subject: Mutation dropped Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing a lot of mutation dropped messages. I understand that this is due to the replica not being written to the other node ? RF = 2, CL =1. From the wiki - For MUTATION messages this means that the mutation was not applied to all replicas it was sent to. The inconsistency will be repaired by Read Repair or Anti Entropy Repair Thanks, Kanwar
Re: [nodetool] repair with vNodes
I'm a bit late, but for reference. Repair runs in two stages, first differences are detected. You an monitor the validation compaction with nodetool compactionstats. Then the differences are streamed between the nodes, you can monitor that with nodetool netstats. Nodetool repair command has been running for almost 24hours and I can’t see any activity from the logs or JMX. Grep for session completed Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 15/02/2013, at 11:38 PM, Haithem Jarraya haithem.jarr...@struq.com wrote: Hi, I am new to Cassandra and I would like to hear your thoughts on this. We are running our tests with Cassandra 1.2.1, in relatively small dataset ~60GB. Nodetool repair command has been running for almost 24hours and I can’t see any activity from the logs or JMX. What am I missing? Or there is a problem with node tool repair? What other commands that I can run to do a sanity check on the cluster? Can I run nodetool repair on different node in the same time? Here is the current test deployment of Cassandra $ nodetool status Datacenter: ams01 (Replication Factor 2) = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.70.48.23 38.38 GB 256 19.0% 7c5fdfad-63c6-4f37-bb9f-a66271aa3423 RAC1 UN 10.70.6.7858.13 GB 256 18.3% 94e7f48f-d902-4d4a-9b87-81ccd6aa9e65 RAC1 UN 10.70.47.126 53.89 GB 256 19.4% f36f1f8c-1956-4850-8040-b58273277d83 RAC1 Datacenter: wdc01 (Replication Factor 1) = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.24.116.66 65.81 GB 256 22.1% f9dba004-8c3d-4670-94a0-d301a9b775a8 RAC1 Datacenter: sjc01 (Replication Factor 1) = Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.55.104.90 63.31 GB 256 21.2% 4746f1bd-85e1-4071-ae5e-9c5baac79469 RAC1 Many Thanks, Haithem
Re: Question on Cassandra Snapshot
With incremental_backup turned OFF in Cassandra.yaml - Are all SSTables are under /data/TestKeySpace/ColumnFamily at all times? No. They are deleted when they are compacted and no internal operations are referencing them. With incremental_backup turned ON in cassandra.yaml - Are current SSTables under /data/TestKeySpace/ColumnFamily/ with a hardlink to /data/TestKeySpace/ColumnFamily/backups? Yes, sort of. *All* SSTables ever created are in the backups directory. Not just the ones currently live. Lets say I have taken snapshot and moved the /data/TestKeySpace/ColumnFamily/snapshots/snapshot-name/*.db to tape, at what point should I be backing up *.db files from /data/TestKeySpace/ColumnFamily/backups directory. Also, should I be deleting the *.db files whose inode matches with the files in the snapshot? Is that a correct approach? Backup all files in the snapshots. There may be non .db extensions files if you use levelled compactions When you are finished with the snapshot delete it. If the inode is not longer referenced from the live data dir it will be deleted. I noticed /data/TestKeySpace/ColumnFamily/snapshots/timestamp-ColumnFamily/ what are these timestamp directories? Probably automatic snapshot from dropping KS or CF's Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 4:41 AM, S C as...@outlook.com wrote: I appreciate any advise or pointers on this. Thanks in advance. From: as...@outlook.com To: user@cassandra.apache.org Subject: Question on Cassandra Snapshot Date: Thu, 14 Feb 2013 20:47:14 -0600 I have been looking at incremental backups and snapshots. I have done some experimentation but could not come to a conclusion. Can somebody please help me understanding it right? /data is my data partition With incremental_backup turned OFF in Cassandra.yaml - Are all SSTables are under /data/TestKeySpace/ColumnFamily at all times? With incremental_backup turned ON in cassandra.yaml - Are current SSTables under /data/TestKeySpace/ColumnFamily/ with a hardlink to /data/TestKeySpace/ColumnFamily/backups? Lets say I have taken snapshot and moved the /data/TestKeySpace/ColumnFamily/snapshots/snapshot-name/*.db to tape, at what point should I be backing up *.db files from /data/TestKeySpace/ColumnFamily/backups directory. Also, should I be deleting the *.db files whose inode matches with the files in the snapshot? Is that a correct approach? I noticed /data/TestKeySpace/ColumnFamily/snapshots/timestamp-ColumnFamily/ what are these timestamp directories? Thanks in advance. SC
unsubscribe
unsubscribe me please. Thank you
RE: NPE in running ClientOnlyExample
Thanks Edward, My Bad. I was confused as It does seems to create keyspace also, As I understand (although i'm not sure) ListCfDef cfDefList = new ArrayListCfDef(); CfDef columnFamily = new CfDef(KEYSPACE, COLUMN_FAMILY); cfDefList.add(columnFamily); try { client.system_add_keyspace(new KsDef(KEYSPACE, org.apache.cassandra.locator.SimpleStrategy, 1, cfDefList)); int magnitude = client.describe_ring(KEYSPACE).size(); Can I request you to please point me to some examples with I can start. I try to see some example from hector but it does seems to be in-line with Cassandra's 1.1 version. Regards, Rahul -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: 17 February 2013 21:49 To: user@cassandra.apache.org Subject: Re: NPE in running ClientOnlyExample This is a bad example to follow. This is the internal client the Cassandra nodes use to talk to each other (fat client) usually you do not use this unless you want to write some embedded code on the Cassandra server. Typically clients use thrift/native transport. But you are likely getting the error you are seeing because the keyspace or column family is not created yet. On Sat, Feb 16, 2013 at 11:41 PM, Jain Rahul ja...@ivycomptech.com wrote: Hi All, I am newbie to Cassandra and trying to run an example program ClientOnlyExample taken from https://raw.github.com/apache/cassandra/cassandra-1.2/examples/client_only/src/ClientOnlyExample.java. But while executing the program it gives me a null pointer exception. Can you guys please help me out what I am missing. I am using Cassandra 1.2.1 version. I have pasted the logs at http://pastebin.com/pmADWCYe Exception in thread main java.lang.NullPointerException at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:71) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:66) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:61) at org.apache.cassandra.db.ColumnFamily.create(ColumnFamily.java:56) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:183) at org.apache.cassandra.db.RowMutation.add(RowMutation.java:204) at ClientOnlyExample.testWriting(ClientOnlyExample.java:78) at ClientOnlyExample.main(ClientOnlyExample.java:135) Regards, Rahul This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk. This email and any attachments are confidential, and may be legally privileged and protected by copyright. If you are not the intended recipient dissemination or copying of this email is prohibited. If you have received this in error, please notify the sender by replying by email and then delete the email completely from your system. Any views or opinions are solely those of the sender. This communication is not intended to form a binding contract unless expressly indicated to the contrary and properly authorised. Any actions taken on the basis of this email are at the recipient's own risk.
Re: unsubscribe
On 02/17/2013 01:26 PM, puneet loya wrote: unsubscribe me please. Thank you if only directions were followed: http://hadonejob.com/images/full/102.jpg send to user-unsubscr...@cassandra.apache.org
Re: odd production issue today 1.1.4
There is always this old chestnut http://wiki.apache.org/cassandra/FAQ#ubuntu_hangs A - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 8:22 AM, Edward Capriolo edlinuxg...@gmail.com wrote: With hyper threading a core can show up as two or maybe even four physical system processors, this is something the kernel does. On Fri, Feb 15, 2013 at 11:41 AM, Hiller, Dean dean.hil...@nrel.gov wrote: We ran into an issue today where website became around 10 times slower. We found out node 5 out of our 6 nodes was hitting 2100% cpu (cat /proc/cpuinfo reveals a 16 processor machine). I am really not sure how to hit 2100% unless we had 21 processors. It bounces between 300% and 2100% so I tried to a do a thread dump and had to use –F which then hotspot hit a nullpointer :(. I copied off all my logs after restarting(should have done it before restarting it). Any ideas what I could even look for as to what went wrong with this node? Also, we know our astyanax for some reason is not setup properly yet so we probably would not have seen an issue had we had all nodes in the seed list(which we changed today) as astyanax is supposed to be measuring time per request and changing which nodes it hits but we know it only hits nodes in our seedlist right now as we have not fixed that yet. Our astyanax was hitting 3,4,5,6 and did not have 1 and 2 in the seed list (we rollout a new version next wed. with the new seedlist including the last two delaying the dynamic discovery config we need to look at). Thanks, Dean Commands I ran with jstack that didn't work out too well…. [cassandra@a5 ~]$ jstack -l 20907 threads.txt 20907: Unable to open socket file: target process not responding or HotSpot VM not loaded The -F option can be used when the target process is not responding [cassandra@a5 ~]$ jstack -l -F 20907 threads.txt Attaching to process ID 20907, please wait... Debugger attached successfully. Server compiler detected. JVM version is 20.7-b02 java.lang.NullPointerException at sun.jvm.hotspot.oops.InstanceKlass.computeSubtypeOf(InstanceKlass.java:426) at sun.jvm.hotspot.oops.Klass.isSubtypeOf(Klass.java:137) at sun.jvm.hotspot.oops.Oop.isA(Oop.java:100) at sun.jvm.hotspot.runtime.DeadlockDetector.print(DeadlockDetector.java:93) at sun.jvm.hotspot.runtime.DeadlockDetector.print(DeadlockDetector.java:39) at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:52) at sun.jvm.hotspot.tools.StackTrace.run(StackTrace.java:45) at sun.jvm.hotspot.tools.JStack.run(JStack.java:60) at sun.jvm.hotspot.tools.Tool.start(Tool.java:221) at sun.jvm.hotspot.tools.JStack.main(JStack.java:86) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at sun.tools.jstack.JStack.runJStackTool(JStack.java:118) at sun.tools.jstack.JStack.main(JStack.java:84) [cassandra@a5 ~]$ java -version java version 1.6.0_32
Re: cassandra vs. mongodb quick question
If you have spinning disk and 1G networking and no virtual nodes, I would still say 300G to 500G is a soft limit. If you are using virtual nodes, SSD, JBOD disk configuration or faster networking you may go higher. The limiting factors are the time it take to repair, the time it takes to replace a node, the memory considerations for 100's of millions of rows. If you the performance of those operations is acceptable to you, then go crazy. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 9:05 AM, Hiller, Dean dean.hil...@nrel.gov wrote: So I found out mongodb varies their node size from 1T to 42T per node depending on the profile. So if I was going to be writing a lot but rarely changing rows, could I also use cassandra with a per node size of +20T or is that not advisable? Thanks, Dean
Re: can we pull rows out compressed from cassandra(lots of rows)?
No. The rows are uncompressed deep down in the IO stack. There is compression in the binary protocol http://www.datastax.com/dev/blog/binary-protocol https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=doc/native_protocol.spec;hb=refs/heads/cassandra-1.2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 16/02/2013, at 9:35 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Thanks, Dean
unsubscribe
On Feb 17, 2013 10:27 AM, puneet loya puneetl...@gmail.com wrote: unsubscribe me please. Thank you
Re: unsubscribe
Please see the Mailing Lists section of the home page. http://cassandra.apache.org user-unsubscr...@cassandra.apache.org From: James Wong jwong...@gmail.commailto:jwong...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Sunday, February 17, 2013 12:06 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: unsubscribe On Feb 17, 2013 10:27 AM, puneet loya puneetl...@gmail.commailto:puneetl...@gmail.com wrote: unsubscribe me please. Thank you
Re: Deleting old items
I'll email the docs people. I believe they are saying use compaction throttling rather than this not this does nothing Although I used this in the last month on a machine with very little ram to limit compaction memory use. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 17/02/2013, at 7:05 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Can you point to the docs. http://www.datastax.com/docs/1.1/configuration/storage_configuration#max-compaction-threshold And thanks about the rest of your answers, once again ;-). Alain 2013/2/16 aaron morton aa...@thelastpickle.com Is that a feature that could possibly be developed one day ? No. Timestamps are essentially internal implementation used to resolve different values for the same column. With min_compaction_level_threshold did you mean min_compaction_threshold ? If so, why should I do that, what are the advantage/inconvenient of reducing this value ? Yes, min_compaction_threshold, my bad. If you have a wide row and delete a lot of values you will end up with a lot of tombstones. These may dramatically reduce the read performance until they are purged. Reducing the compaction threshold makes compaction happen more frequently. Looking at the doc I saw that: max_compaction_threshold: Ignored in Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount of SSTables then ? AFAIK it's not. There may be some confusion about the location of the settings in CLI vs CQL. Can you point to the docs. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 13/02/2013, at 10:14 PM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Aaron, once again thanks for this answer. So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. Why is there no way of deleting or getting data using the internal timestamp stored alongside of any inserted column (as described here: http://www.datastax.com/docs/1.1/ddl/column_family#standard-columns) ? Is that a feature that could possibly be developed one day ? It could be useful to perform delete of old data or to bring to a dev cluster just the last week of data for example. With min_compaction_level_threshold did you mean min_compaction_threshold ? If so, why should I do that, what are the advantage/inconvenient of reducing this value ? Looking at the doc I saw that: max_compaction_threshold: Ignored in Cassandra 1.1 and later.. How to ensure that I'll always keep a small amount of SSTables then ? Why is this deprecated ? Alain 2013/2/12 aaron morton aa...@thelastpickle.com So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? No. You need to issue row level deletes. If you don't know the row key you'll need to do range scans to locate them. If you are deleting parts of wide rows consider reducing the min_compaction_level_threshold on the CF to 2 Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/02/2013, at 4:21 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, I would like to know if there is a way to delete old/unused data easily ? I know about TTL but there are 2 limitations of TTL: - AFAIK, there is no TTL on counter columns - TTL need to be defined at write time, so it's too late for data already inserted. I also could use a standard delete but it seems inappropriate for such a massive. In some cases, I don't know the row key and would like to delete all the rows starting by, let's say, 1050#... Even better, I understood that columns are always inserted in C* with (name, value, timestamp). So is it possible to delete all the data inserted in some CF between 2 dates or data older than 1 month ? Alain
Re: Is there any consolidated literature about Read/Write and Data Consistency in Cassandra ?
If you want the underlying ideas try the Dynamo paper, the Big Table paper and the original Cassandra paper from facebook. Start here http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 17/02/2013, at 7:40 AM, mateus mat...@tripleoxygen.net wrote: Like articles with tests and conclusions about it, and such, and not like the documentation in DataStax, or the Cassandra Books. Thank you.
Re: nodetool repair with vnodes
…so it seems to me that it is running on all vnodes ranges. Yes. Also, whatever the node which I launch the command on is, only one node log is moving and is always the same node. Not sure what you mean here. So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. If you use nodetool repair without the -pr flag in your setup (3 nodes and I assume RF 3) it will repair all token ranges in the cluster. Is there anything I'm missing ? Look for messages with session completed in the log from the AntiEntropyService Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/02/2013, at 12:51 AM, Marco Matarazzo marco.matara...@hexkeep.com wrote: Greetings. I'm trying to run nodetool repair on a Cassandra 1.2.1 cluster of 3 nodes with 256 vnodes each. On a pre-1.2 cluster I used to launch a nodetool repair on every node every 24hrs. Now I'm getting a differenf behavior, and I'm sure I'm missing something. What I see on the command line is: [2013-02-17 10:20:15,186] Starting repair command #1, repairing 768 ranges for keyspace goh_master [2013-02-17 10:48:13,401] Repair session 3d140e10-78e3-11e2-af53-d344dbdd69f5 for range (6556914650761469337,6580337080281832001] finished (…repeat the last line 767 times) …so it seems to me that it is running on all vnodes ranges. Also, whatever the node which I launch the command on is, only one node log is moving and is always the same node. So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. I'm sure I'm making some mistakes, and I just can't find any clue of what's wrong with my nodetool usage on the documentation (if anything is wrong, btw). Is there anything I'm missing ? -- Marco Matarazzo
Re: Nodetool doesn't shows two nodes
Hi, I've checked all things Alain suggested and set up a fresh 2-node cluster, and I still get the same result: each node lists itself as only one. This time I made the following changes: - I set listen_address to the public DNS name. Internally, AWS's DNS will map this to the 10.x IP, so this should work correctlly if I understand right. These are new EC2 instances, and I did not trust configured hostname or so on. - I opened all ports between nodes in security group. - I kept the snitch at Ec2MultiRegionSnitch. This cluster is small now but it will be very large and nationwide if I succeed and choose Cassandra for this purpose. Do I right understand that it is not possible to change this later, or at least is not easy? - I ensured all Alain suggestions, for example cluster_name is same with all nodes. - I set seed list to public DNS name of first node. This is identical on both node. - I checked Alain's suggest about auto_bootstrap. Docs say this is not needed to set. Is this docs wrong? (I look at DataStax 1.2 PDF docs) Here is some more debugging evidence. On node 1, the seed, [root@ip-10-113-19-24 ~]# ifconfig | grep inet.addr inet addr:10.113.19.24 Bcast:10.113.19.255 Mask:255.255.254.0 [root@ip-10-113-19-24 ~]# nodetool status Datacenter: us-east === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 23.22.204.201 20.97 KB 256 100.0% 4fadd4fd-c57c-4172-95aa-092368ba5743 1a [root@ip-10-113-19-24 ~]# netstat -antp Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp0 0 0.0.0.0:71990.0.0.0:* LISTEN 1910/java tcp0 0 0.0.0.0:47298 0.0.0.0:* LISTEN 1910/java tcp0 0 0.0.0.0:57030 0.0.0.0:* LISTEN 1910/java tcp0 0 0.0.0.0:91600.0.0.0:* LISTEN 1910/java tcp0 0 0.0.0.0:90420.0.0.0:* LISTEN 1910/java tcp0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1231/sshd tcp0 0 10.113.19.24:7000 0.0.0.0:* LISTEN 1910/java tcp0 1 10.113.19.24:38948 54.234.147.60:7000 SYN_SENT1910/java tcp0 0 10.113.19.24:7000 10.113.19.24:45328 ESTABLISHED 1910/java tcp0 0 10.113.19.24:7000 10.114.205.157:47713 ESTABLISHED 1910/java tcp0 1 10.113.19.24:45597 23.22.204.201:7000 SYN_SENT1910/java tcp0 0 10.113.19.24:45328 10.113.19.24:7000 ESTABLISHED 1910/java And in the log, INFO 20:58:12,472 Node /23.22.204.201 state jump to normal INFO 20:58:12,482 Startup completed! Now serving reads. Now, this looks similar to the problem before with the private IP addresses being used some times, public other times. By the way, the other node, whose internal IP address is 10.114.205.157, is connected to this seed node as you can see. I think I could understand this problem if I understand which types of network connections I should expect to see in the netstat, and what output I should expect to see in the log. Can someone with more experience tell me what is wrong/unexpected above? And am I working against Amazon's architecture by using IPs the way I do? While I wait for answer, I will shut down, delete all data, and reconfigure with public IP addresses explicitly and not use DNS names :-) I have a feeling this is the problem. From within Amazon EC2 server, requesting DNS for a public DNS name returns the private IP address. (However, I still feel unsure about what is right way to do this, because I do not know if Cassandra will use DNS resolve and end up trying to connect to a private IP that Cassandra is not listening.) Thanks, - Boris On Wed, Feb 13, 2013 at 10:37 AM, Boris Solovyov boris.solov...@gmail.comwrote: Thank you Alain. I will check the things you suggest and report my results. - Boris On Wed, Feb 13, 2013 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: Hi Boris. I feel like I have made a beginner's mistake That's an horrible feeling :D. I'll try to help ;) cluster_name: 'TS' Are you sure you used the same name for both node ? I can connect to port 7000 You can check all the ports needed there http://www.datastax.com/docs/1.2/install/install_ami and open them in security group once and for all so you won't be wondering this anymore. listen_address: 10.145.232.190 INFO 19:36:32,710 Node /107.22.114.19 state jump to normal There is 10.145.232.190 defined as listen address and you logs says that 107.22.114.19 joined the ring and your second ip seems to be 23.21.11.193... When you stop an EC2 server, its internal ip may change. So I recommend you not to do so, but restart them instead. Anyway you should
Re: Is C* common nickname for Cassandra?
Why do you feel that link is unprofessional? Just wondering. I actually quite like the abbreviation personally. On Feb 17, 2013, at 1:37 PM, Boris Solovyov boris.solov...@gmail.commailto:boris.solov...@gmail.com wrote: Thanks. I don't know if anyone cares my opinion, but as a newcomer to the community, my feedback is that it is not needed. At best it confuses a newbie and makes him feel like an outsider. At worst it just looks totally unprofessional, like here: http://www.planetcassandra.org/blog/post/calling-all-apache-cassandra-speakers it is hard to form a good opinion of Cassandra project when it is being discussed like that. Hopefully this is helpful constructive criticism and not just useless flamebait or trollbait. Boris On Fri, Feb 8, 2013 at 11:51 AM, Tyler Hobbs ty...@datastax.commailto:ty...@datastax.com wrote: Yes, C* is short for Cassandra. On Fri, Feb 8, 2013 at 10:43 AM, Boris Solovyov boris.solov...@gmail.commailto:boris.solov...@gmail.com wrote: I see people refer to C* and I assume it mean Cassandra, but just wanted to check for sure. In case it is somethings else and I miss it :) Do I right understand? -- Tyler Hobbs DataStaxhttp://datastax.com/
Re: Is C* common nickname for Cassandra?
Is hard to say, really. I guess just feels like not very serious, overly casual, which mean not treating the project with respect? I guess I believe if you want something treated with respect you must demonstrate how seriously you take it oneself. I am sure this is personal opinion only, but perhaps it is shared by others. Enterprise Pointy Haired Boss might make purchase decision on this criteria instead of technical merits. You know they make decision based on how pretty project logo is half the time :-) Hope this helps Boris On Sun, Feb 17, 2013 at 4:42 PM, Michael Kjellman mkjell...@barracuda.comwrote: Why do you feel that link is unprofessional? Just wondering. I actually quite like the abbreviation personally. On Feb 17, 2013, at 1:37 PM, Boris Solovyov boris.solov...@gmail.com wrote: Thanks. I don't know if anyone cares my opinion, but as a newcomer to the community, my feedback is that it is not needed. At best it confuses a newbie and makes him feel like an outsider. At worst it just looks totally unprofessional, like here: http://www.planetcassandra.org/blog/post/calling-all-apache-cassandra-speakers it is hard to form a good opinion of Cassandra project when it is being discussed like that. Hopefully this is helpful constructive criticism and not just useless flamebait or trollbait. Boris On Fri, Feb 8, 2013 at 11:51 AM, Tyler Hobbs ty...@datastax.com wrote: Yes, C* is short for Cassandra. On Fri, Feb 8, 2013 at 10:43 AM, Boris Solovyov boris.solov...@gmail.com wrote: I see people refer to C* and I assume it mean Cassandra, but just wanted to check for sure. In case it is somethings else and I miss it :) Do I right understand? -- Tyler Hobbs DataStax http://datastax.com/
Re: Nodetool doesn't shows two nodes
No, it doesn't works, same thing: both nodes seems to just exist solo and I have 2 single-node clusters :-( OK, so now I am confused, and hope list will help me out. To understand what wrong, I think I need to know what happens in node bootstrap, and in node join ring. Who does node communicate, on which address? What information it exchanges? What happens then? What this process looks like normally? I have read all docs, several time, don't think I missed it, so it might not be explain there clearly. I will look again, and look to source code next. - Boris On Sun, Feb 17, 2013 at 4:48 PM, Boris Solovyov boris.solov...@gmail.comwrote: Aha! I think I might have something breakthrough. I tried setting public IP in listen_address (and therefore in broadcast_address, because as I understand it inherits if it is commented out), and in seeds list. Node fails to start, because Cassandra cannot bind to public IP address: it does not exists on box. Of course! This is why I cannot see it in ifconfig. SO, my next theory, - set listen_address to private IP - set broadcast_address to public IP, tells other nodes how to connect - set seeds to public IP I will try this next and continue flood your inbox with stream of consciousness try-and-error ;-)
Re: nodetool repair with vnodes
So, to me, it's like the nodetool repair command is running always on the same single node and repairing everything. If you use nodetool repair without the -pr flag in your setup (3 nodes and I assume RF 3) it will repair all token ranges in the cluster. That's correct, 3 nodes and RF 3. Sorry for not specifying it in the beginning. So, running it periodically on just one node is enough for cluster maintenance ? Does this depends on the fact that every vnode data is related with the previous and next vnode, and this particular setup makes this enough as it cover every physical node? Also: running it with -pr does output: [2013-02-17 12:29:25,293] Nothing to repair for keyspace 'system' [2013-02-17 12:29:25,301] Starting repair command #2, repairing 1 ranges for keyspace keyspace_test [2013-02-17 12:29:28,028] Repair session 487d0650-78f5-11e2-a73a-2f5b109ee83c for range (-9177680845984855691,-9171525326632276709] finished [2013-02-17 12:29:28,028] Repair command #2 finished … that, as far as I can understand, works on the first vnode on the specified node, or so it seems from the output range. Am I right? Is there a way to run it only for all vnodes on a single physical node ? Thank you! -- Marco Matarazzo
Re: Nodetool doesn't shows two nodes
OK. I got it. I realized that storage_port wasn't actually open between the nodes, because it is using the public IP. (I did find this information in the docs, after looking more... it is in section on Types of snitches. It explains everything I found by try and error.) After opening this port 7000 to all IP addresses, the cluster boots OK and the two nodes see each other. Now I have the happy result. But my nodes are wide open to the entire internet on port 7000. This is a serious problem. This obviously can't be put into production. I definitely need cross-continent deployment. Single AZ or single region deployment is not going to be enough. How do people solve this in practice?
Re: Size Tiered - Leveled Compaction
We doubled the SStable size to 10M. It still generates a lot of SSTable and we don't see much difference of the read latency. We are able to finish the compactions after repair within serveral hours. We will increase the SSTable size again if we feel the number of SSTable hurts the performance. - Original Message - From: Mike mthero...@yahoo.com To: user@cassandra.apache.org Sent: Sunday, February 17, 2013 4:50:40 AM Subject: Re: Size Tiered - Leveled Compaction Hello Wei, First thanks for this response. Out of curiosity, what SSTable size did you choose for your usecase, and what made you decide on that number? Thanks, -Mike On 2/14/2013 3:51 PM, Wei Zhu wrote: I haven't tried to switch compaction strategy. We started with LCS. For us, after massive data imports (5000 w/seconds for 6 days), the first repair is painful since there is quite some data inconsistency. For 150G nodes, repair brought in about 30 G and created thousands of pending compactions. It took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. System performance degrades during that time since reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can and couldn't speed it up. I think it's single threaded and it's not recommended to turn on multithread compaction. We even tried that, it didn't help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. Haven't upgraded yet, hope it works:) http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 Since our cluster is not write intensive, only 100 w/seconds. I don't see any pending compactions during regular operation. One thing worth mentioning is the size of the SSTable, default is 5M which is kind of small for 200G (all in one CF) data set, and we are on SSD. It more than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable creates 4 files on disk) You might want to watch that and decide the SSTable size. By the way, there is no concept of Major compaction for LCS. Just for fun, you can look at a file called $CFName.json in your data directory and it tells you the SSTable distribution among different levels. -Wei From: Charles Brophy cbro...@zulily.com To: user@cassandra.apache.org Sent: Thursday, February 14, 2013 8:29 AM Subject: Re: Size Tiered - Leveled Compaction I second these questions: we've been looking into changing some of our CFs to use leveled compaction as well. If anybody here has the wisdom to answer them it would be of wonderful help. Thanks Charles On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote: Hello, I'm investigating the transition of some of our column families from Size Tiered - Leveled Compaction. I believe we have some high-read-load column families that would benefit tremendously. I've stood up a test DB Node to investigate the transition. I successfully alter the column family, and I immediately noticed a large number (1000+) pending compaction tasks become available, but no compaction get executed. I tried running nodetool sstableupgrade on the column family, and the compaction tasks don't move. I also notice no changes to the size and distribution of the existing SSTables. I then run a major compaction on the column family. All pending compaction tasks get run, and the SSTables have a distribution that I would expect from LeveledCompaction (lots and lots of 10MB files). Couple of questions: 1) Is a major compaction required to transition from size-tiered to leveled compaction? 2) Are major compactions as much of a concern for LeveledCompaction as their are for Size Tiered? All the documentation I found concerning transitioning from Size Tiered to Level compaction discuss the alter table cql command, but I haven't found too much on what else needs to be done after the schema change. I did these tests with Cassandra 1.1.9. Thanks, -Mike
Re: Nodetool doesn't shows two nodes
This is something that I found while using the multi-region snitch - it uses public IPs for communication. See the original ticket here: https://issues.apache.org/jira/browse/CASSANDRA-2452. It'd be nice if it used the private IPs to communicate with nodes that are in the same region as itself, but I do not believe this is the case. Be aware that you will be charged for external data transfer even for nodes in the same region because the traffic will not fall under their free (for same AZ) or reduced (for intra-AZ) tiers. If you continue using this snitch in the mean time, it is not necessary (or recommended) to have those ports open to 0.0.0.0/0. You'll simply need to add the public IPs of your C* servers to the correct security group(s) to allow access. There's something else that's a little strange about the EC2 snitches: us-east-1 is (incorrectly) represented as the datacenter us-east. Other regions are recognized and named properly (us-west-2 for example) This is kind-of covered in the ticket here: https://issues.apache.org/jira/browse/CASSANDRA-4026 I wish it could be fixed properly. Good luck! On 17 February 2013 16:16, Boris Solovyov boris.solov...@gmail.com wrote: OK. I got it. I realized that storage_port wasn't actually open between the nodes, because it is using the public IP. (I did find this information in the docs, after looking more... it is in section on Types of snitches. It explains everything I found by try and error.) After opening this port 7000 to all IP addresses, the cluster boots OK and the two nodes see each other. Now I have the happy result. But my nodes are wide open to the entire internet on port 7000. This is a serious problem. This obviously can't be put into production. I definitely need cross-continent deployment. Single AZ or single region deployment is not going to be enough. How do people solve this in practice?