Re: Completely removing a node from the cluster

2011-08-23 Thread Jonathan Colby
I ran into this.  I also tried log_ring_state=false which also did not help.   
The way I got through this was to stop the entire cluster and start the nodes 
one-by-one.   

I realize this is not a practical solution for everyone, but if you can afford 
to stop the cluster for a few minutes, it's worth a try.


On Aug 23, 2011, at 9:26 AM, aaron morton wrote:

 I'm running low on ideas for this one. Anyone else ? 
 
 If the phantom node is not listed in the ring, other nodes should not be 
 storing hints for it. You can see what nodes they are storing hints for via 
 JConsole. 
 
 You can try a rolling restart passing the JVM opt 
 -Dcassandra.load_ring_state=false However if the phantom node is been passed 
 around in the gossip state it will probably just come back again. 
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
 
 Could this ghost node be causing my hints column family to grow to this 
 size?  I also crash after about 24 hours due to commit logs growth taking up 
 all the drive space.  A manual nodetool flush keeps it under control though.
 
 
   Column Family: HintsColumnFamily
   SSTable count: 6
   Space used (live): 666480352
   Space used (total): 666480352
   Number of Keys (estimate): 768
   Memtable Columns Count: 1043
   Memtable Data Size: 461773
   Memtable Switch Count: 3
   Read Count: 38
   Read Latency: 131.289 ms.
   Write Count: 582108
   Write Latency: 0.019 ms.
   Pending Tasks: 0
   Key cache capacity: 7
   Key cache size: 6
   Key cache hit rate: 0.8334
   Row cache: disabled
   Compacted row minimum size: 2816160
   Compacted row maximum size: 386857368
   Compacted row mean size: 120432714
 
 Is there a way for me to manually remove this dead node?
 
 -Original Message-
 From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] 
 Sent: Sunday, August 21, 2011 9:09 PM
 To: user@cassandra.apache.org
 Subject: RE: Completely removing a node from the cluster
 
 It's been at least 4 days now.
 
 -Original Message-
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Sunday, August 21, 2011 3:16 PM
 To: user@cassandra.apache.org
 Subject: Re: Completely removing a node from the cluster
 
 I see the mistake I made about ring, gets the endpoint list from the same 
 place but uses the token's to drive the whole process. 
 
 I'm guessing here, don't have time to check all the code. But there is a 3 
 day timeout in the gossip system. Not sure if it applies in this case. 
 
 Anyone know ?
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
 
 Both .2 and .3 list the same from the mbean that Unreachable is empty 
 collection, and Live node lists all 3 nodes still:
 192.168.20.2
 192.168.20.3
 192.168.20.1
 
 The removetoken was done a few days ago, and I believe the remove was done 
 from .2
 
 Here is what ring outlook looks like, not sure why I get that token on the 
 empty first line either:
 Address DC  RackStatus State   LoadOwns 
Token
 
 85070591730234615865843651857942052864
 192.168.20.2datacenter1 rack1   Up Normal  79.53 GB   
 50.00%  0
 192.168.20.3datacenter1 rack1   Up Normal  42.63 GB   
 50.00%  85070591730234615865843651857942052864
 
 Yes, both nodes show the same thing when doing a describe cluster, that .1 
 is unreachable.
 
 
 -Original Message-
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Sunday, August 21, 2011 4:23 AM
 To: user@cassandra.apache.org
 Subject: Re: Completely removing a node from the cluster
 
 Unreachable nodes in either did not respond to the message or were known to 
 be down and were not sent a message. 
 The way the node lists are obtained for the ring command and describe 
 cluster are the same. So it's a bit odd. 
 
 Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean 
 ? What do the LiveNode and UnrechableNodes attributes say ? 
 
 Also how long ago did you remove the token and on which machine? Do both 
 20.2 and 20.3 think 20.1 is still around ? 
 
 Cheers
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
 
 I'm on 0.8.4
 
 I have removed a dead node from the cluster using nodetool removetoken 
 command, and moved one of the remaining nodes to rebalance the tokens.  
 Everything looks fine when I run nodetool ring now, as 

Re: Re: Urgent:!! Re: Need to maintenance on a cassandra node, are there problems with this process

2011-08-19 Thread jonathan . colby

Hi -

From what I understand, Peter's recommendation should work for you. They  
have both worked for me. No need to copy anything by hand on the new node.  
Bootstrap/repair does that for you. From the Wiki:



If a node goes down entirely, then you have two options:

(Recommended approach) Bring up the replacement node with a new IP address,  
Set initial token to (failure node's token) - 1 and AutoBootstrap set to  
true in cassandra.yaml (storage-conf.xml for 0.6 or earlier). This will  
place the replacement node in front of the failure node. Then the bootstrap  
process begins. While this process runs, the node will not receive reads  
until finished. Once this process is finished on the replacement node, run  
nodetool removetoken once, supplying the token of the dead node, and  
nodetool cleanup on each node. You can obtain the dead node's token by  
running nodetool ring on any live node, unless there was some kind of  
outage, and the others came up but not the down one -- in that case, you  
can retrieve the token from the live nodes' system tables.


(Alternative approach) Bring up a replacement node with the same IP and  
token as the old, and run nodetool repair. Until the repair process is  
complete, clients reading only from this node may get no data back. Using a  
higher ConsistencyLevel on reads will avoid this.


On , Anand Somani meatfor...@gmail.com wrote:
Let me be specific on lost data - lost a replica , the other 2 nodes  
have replicas


I am running read/write at quorum. At this point I have turned off my  
clients from talking to this node. So if that is the case I can  
potentially just nodetool repair (without changing IP). But would it be  
better if I copied over the data/mykeyspace from another replica and then  
run repair?



On Fri, Aug 19, 2011 at 11:20 AM, Peter Schuller  
peter.schul...@infidyne.com wrote:


 ok, so we just lost the data on that node. are building the raid on it,  
but



 once it is up what is the best way to bring it back in the cluster






You're saying the raid failed and data is gone?





 just let it come up and run nodetool repair



 copy data from another node and then run nodetool repair,







 do I still need to run repair immeidately if I copy the data? Want to



 schedule repair for later during non peak hours?






If data is gone, the safe way is to have it re-join the cluster:





http://wiki.apache.org/cassandra/Operations#Handling_failure





But note that in your case, since you've lost data (if I understand



you), it's effectively a completely new node. That means you either



want to switch it's IP address and go for the recommended approach,



or do the other option but that WILL mean the node is serving reads



with incorrect data, violating the consistency. Depending on your



application, this may or may not be the case.





Unless it's a major problem for you, I suggest bringing it back in



with a new IP address and make it be treated like a completely fresh



replacement node. Probably decreases the risk of mistakes happening.





As for the other stuff about repair in the e-mail you pasted; periodic



repairs are part of regular cluster maintenance. See:





http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair





--



/ Peter Schuller (@scode on twitter)








upgrade from 0.7.6 to 0.8.4

2011-08-16 Thread Jonathan Colby
Hi - sorry if this was asked before but I couldn't find any answers about it.

Is the upgrade path from 0.7.6 to 0.8.4 possible via a simple rolling restart?  
  

Are nodes with these different versions compatible - i.e., can one node be 
upgraded in order to see if we run into any problems before upgrading the 
others?




Re: Re: Cassandra start/stop scripts

2011-07-27 Thread jonathan . colby

A simple kill without -9 should work. Have you tried that?

On , Jason Pell jasonmp...@gmail.com wrote:
Check out the rpm packages from Cassandra they have init.d scripts that  
work very nicely, there are debs as well for ubuntu



Sent from my iPhone



On Jul 27, 2011, at 3:19, Priyanka priya...@gmail.com wrote:





I do the same way...



On Tue, Jul 26, 2011 at 1:07 PM, mcasandra [via [hidden email]] [hidden  
email] wrote:



I need to write cassandra start/stop script. Currently I run cassandra  
to start and kill -9 to stop.



Is this the best way? kill -9 doesn't sound right :) Wondering how others  
do it.






If you reply to this email, your message will be added to the discussion  
below:

http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-start-stop-scripts-tp6622977p6622977.html




To start a new topic under [hidden email], email [hidden email]



To unsubscribe from [hidden email], click here.










View this message in context: Re: Cassandra start/stop scripts


Sent from the cassandra-u...@incubator.apache.org mailing list archive at  
Nabble.com.






eliminate need to repair by using column TTL??

2011-07-22 Thread jonathan . colby
One of the main reasons for regularly running repair is to make sure  
deletes are propagated in the cluster, ie, data is not resurrected if a  
node never received the delete call.


And repair-on-read takes care of repairing inconsistencies on-the-fly.

So if I were to set a universal TTL on all columns - so everything would  
only live for a certain age, would I be able to get away without having to  
do regular repairs with nodetool?


I realize this scenario would not be applicable for everyone, but our data  
model would allow us to do this.


So could this be an alternative to running the (resource-intensive,  
long-running) repairs with nodetool?


Thanks.


Re: Re: eliminate need to repair by using column TTL??

2011-07-22 Thread jonathan . colby
good points Aaron. I realize now how expensive repair on reads are. I'm  
going to keep doing repairs regularly but still have a max TTL on all  
columns to make sure we don't have really old data we no longer need  
getting buried in the cluster.


On , aaron morton aa...@thelastpickle.com wrote:
Read repair will only repair data that is read on the nodes that are up  
at that time, and does not guarantee that any changes it detects will be  
written back to the nodes. The diff mutations are async fire and forget  
messages which may go missing or be dropped or ignored by the recipient  
just like any other message.




Also getting hit with a bunch of read repair operations is pretty  
painful. The normal read runs, the coordinator detects the digest  
mis-match, the read runs again from all nodes and they all have to return  
their full data (no digests this time), the coordinator detects the  
diffs, mutations are sent back to each node that needs them. All this  
happens sync to the read request when the CL  ONE. Thats 2 reads with  
more network IO and up to RF mutations .




The delete thing is important but repair also reduces the chance of reads  
getting hit with RR and gives me confidence when it's necessary to nuke a  
bad node.




Your plan may work but it feels risky to me. You may end up with worse  
read performance and unpleasent emotions if you ever have to nuke a node.  
Others may disagree.




Not ignoring the fact the repair can take a long time, fail, hurt  
performance etc. There are plans to improve it though.





Cheers





-



Aaron Morton



Freelance Cassandra Developer



@aaronmorton



http://www.thelastpickle.com





On 22 Jul 2011, at 19:55, jonathan.co...@gmail.com wrote:




 One of the main reasons for regularly running repair is to make sure  
deletes are propagated in the cluster, ie, data is not resurrected if a  
node never received the delete call.







 And repair-on-read takes care of repairing inconsistencies on-the-fly.






 So if I were to set a universal TTL on all columns - so everything  
would only live for a certain age, would I be able to get away without  
having to do regular repairs with nodetool?






 I realize this scenario would not be applicable for everyone, but our  
data model would allow us to do this.






 So could this be an alternative to running the (resource-intensive,  
long-running) repairs with nodetool?







 Thanks.






Repair question - why is so much data transferred?

2011-07-21 Thread Jonathan Colby
I regularly run repair on my cassandra cluster.   However, I often seen that 
during the repair operation very large amounts of data are transferred to other 
nodes.

My questions is, if only some data is out of sync,  why are entire Data files 
being transferred?

   /var/lib/cassandra/data/DFS/main-f-893-Data.db sections=2602 
progress=22942842880/63149903764 - 36%
   /var/lib/cassandra/data/DFS/main-f-946-Data.db sections=1437 
progress=0/65991601 - 0%
   /var/lib/cassandra/data/DFS/main-f-907-Data.db sections=2602 
progress=0/1635822909 - 0%

My guess is that since data in the Data files is immutable, it needs to copy 
the entire file over, then I assume a compaction would take place to 
consolidate the data.  But that's just my wild guess.

Can anyone explain this behavior?




Re: Re: Repair question - why is so much data transferred?

2011-07-21 Thread jonathan . colby

from ticket 2818:
One (reasonably simple) proposition to fix this would be to have repair  
schedule validation compactions across nodes one by one (ie, one CF/range  
at a time), waiting for all nodes to return their tree before submitting  
the next request. Then on each node, we should make sure that the node will  
start the validation compaction as soon as requested. For that, we probably  
want to have a specific executor for validation compaction


.. This was the way I thought repair worked.

Anyway, in our case, we only have one CF, so I'm not sure if both issues  
apply to my situation.


Thanks. Looking forward to the release where these 2 things are fixed.

On , Jonathan Ellis jbel...@gmail.com wrote:

On Thu, Jul 21, 2011 at 9:14 AM, Jonathan Colby



jonathan.co...@gmail.com wrote:


 I regularly run repair on my cassandra cluster. However, I often seen  
that during the repair operation very large amounts of data are  
transferred to other nodes.





https://issues.apache.org/jira/browse/CASSANDRA-2280



https://issues.apache.org/jira/browse/CASSANDRA-2816




 My questions is, if only some data is out of sync, why are entire Data  
files being transferred?





Repair streams ranges of files as a unit (which becomes a new file on



the target node) rather than using the normal write path.





--



Jonathan Ellis



Project Chair, Apache Cassandra



co-founder of DataStax, the source for professional Cassandra support



http://www.datastax.com




Re: Decorator Algorithm

2011-06-24 Thread Jonathan Colby
thanks guys. That clears things up.

On Jun 24, 2011, at 4:53 AM, Maki Watanabe wrote:

 A little addendum
 
 Key := Your data to identify a row
 Token := Index on the ring calculated from Key. The calculation is
 defined in replication strategy.
 
 You can lookup responsible nodes (endpoints) for a specific key with
 JMX getNaturalEndpoints interface.
 
 maki
 
 
 2011/6/24 aaron morton aa...@thelastpickle.com:
 Various places in the code call IPartitioner.decorateKey() which returns a 
 DecoratedKeyT which contains both the original key and the TokenT
 
 The RandomPartitioner md5 to hash the key ByteBuffer and create a 
 BigInteger. OPP converts the key into utf8 encoded String.
 
 Using the token to find which endpoints contain replicas is done by the 
 AbstractReplicationStrategy.calculateNaturalEndpoints() implementations.
 
 Does that help?
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 23 Jun 2011, at 19:58, Jonathan Colby wrote:
 
 Hi -
 
 I'd like to understand more how the token is hashed with the key to 
 determine on which node the data is stored - called decorating in cassandra 
 speak.
 
 Can anyone share any documentation on this or describe this more in detail? 
   Yes, I could look at the code, but I was hoping to be able to read more 
 about how it works first.
 
 thanks.
 
 
 
 
 
 -- 
 w3m



Decorator Algorithm

2011-06-23 Thread Jonathan Colby
Hi -

I'd like to understand more how the token is hashed with the key to determine 
on which node the data is stored - called decorating in cassandra speak.

Can anyone share any documentation on this or describe this more in detail?   
Yes, I could look at the code, but I was hoping to be able to read more about 
how it works first.

thanks.

Re: insufficient space to compact even the two smallest files, aborting

2011-06-23 Thread Jonathan Colby
A compaction will be triggered when min number of same sized SStable files 
are found.   So what's actually the purpose of  the max part of the 
threshold?   


On Jun 23, 2011, at 12:55 AM, aaron morton wrote:

 Setting them to 2 and 2 means compaction can only ever compact 2 files at 
 time, so it will be worse off.
 
 Lets the try following:
 
 - restore the compactions settings to the default 4 and 32
 - run `ls -lah` in the data dir and grab the output
 - run `nodetool flush` this will trigger minor compaction once the memtables 
 have been flushed
 - check the logs for messages from 'CompactionManager'
 - when done grab the output from  `ls -lah` again. 
 
 Hope that helps. 
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 23 Jun 2011, at 02:04, Héctor Izquierdo Seliva wrote:
 
 Hi All. I set the compaction threshold at minimum 2, maximum 2 and try
 to run compact, but it's not doing anything. There are over 69 sstables
 now, read performance is horrible, and it's taking an insane amount of
 space. Maybe I don't quite get how the new per bucket stuff works, but I
 think this is not normal behaviour.
 
 El lun, 13-06-2011 a las 10:32 -0500, Jonathan Ellis escribió:
 As Terje already said in this thread, the threshold is per bucket
 (group of similarly sized sstables) not per CF.
 
 2011/6/13 Héctor Izquierdo Seliva izquie...@strands.com:
 I was already way over the minimum. There were 12 sstables. Also, is
 there any reason why scrub got stuck? I did not see anything in the
 logs. Via jmx I saw that the scrubbed bytes were equal to one of the
 sstables size, and it stuck there for a couple hours .
 
 El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió:
 That most likely happened just because after scrub you had new files
 and got over the 4 file minimum limit.
 
 https://issues.apache.org/jira/browse/CASSANDRA-2697
 
 Is the bug report.
 
 
 
 
 
 
 
 
 
 
 



simple question about merged SSTable sizes

2011-06-22 Thread Jonathan Colby

The way compaction works,  x same-sized files are merged into a new SSTable.  
This repeats itself and the SSTable get bigger and bigger.

So what is the upper limit?? If you are not deleting stuff fast enough, 
wouldn't the SSTable sizes grow indefinitely?

I ask because we have some rather large SSTable files (80-100 GB) and I'm 
starting to worry about future compactions.

Second, compacting such large files is an IO killer.What can be tuned other 
than compaction_threshold to help optimize this and prevent the files from 
getting too big?

Thanks!

Re: simple question about merged SSTable sizes

2011-06-22 Thread Jonathan Colby
Thanks for the explanation.  I'm still a bit skeptical.   

So if you really needed to control the maximum size of compacted SSTables,  you 
need to delete data at such a rate that the new files created by compaction are 
less than or equal to the sum of the segments being merged.

Is anyone else running into really large compacted SSTables that gave you 
trouble with hard disk capacity?  How did you deal with it?   

We have 1 TB disks in our nodes, but keeping in mind we need to have at least 
50% for the worst case compaction scenario I'm still a bit worried that one day 
we're going to hit a dead end.



On Jun 22, 2011, at 6:50 PM, Eric tamme wrote:

 On Wed, Jun 22, 2011 at 12:35 PM, Jonathan Colby
 jonathan.co...@gmail.com wrote:
 
 The way compaction works,  x same-sized files are merged into a new 
 SSTable.  This repeats itself and the SSTable get bigger and bigger.
 
 So what is the upper limit?? If you are not deleting stuff fast enough, 
 wouldn't the SSTable sizes grow indefinitely?
 
 I ask because we have some rather large SSTable files (80-100 GB) and I'm 
 starting to worry about future compactions.
 
 Second, compacting such large files is an IO killer.What can be tuned 
 other than compaction_threshold to help optimize this and prevent the files 
 from getting too big?
 
 Thanks!
 
 
 The compaction is an iterative process that first compacts uncompacted
 SSTables and removes tombstones etc.  This compaction takes multiple
 files and merges them into one SSTable.  This process repeats until
 you have compaction_threshold=X number of similarly sized SSTables,
 then those will get re-compacted (merged) together.  The number and
 size of SSTables that you have as a result of a flush is tuned by max
 size, or records, or time.  Contrary to what you might believe, having
 fewer larger SSTables reduces IO compared to compacting many small
 SSTables.  Also the merge operation of previously compacted SSTables
 is relatively fast.
 
 As far as I know, cassandra will continue compacting SSTables into an
 indefinitely larger sized SSTable.  The tunable side of things is for
 adjusting when to flush memtable to SSTable, and the number of
 SSTables of similar size that must be present to execute a compaction.
 
 -Eric



Re: simple question about merged SSTable sizes

2011-06-22 Thread Jonathan Colby
So the take-away is try to avoid major compactions at all costs!   Thanks Ed 
and Eric.

On Jun 22, 2011, at 7:00 PM, Edward Capriolo wrote:

 Yes, if you are not deleting fast enough they will grow. This is not 
 specifically a cassandra problem /var/log/messages has the same issue. 
 
 There is a JIRA ticket about having a maximum size for SSTables, so they 
 always stay manageable
 
 You fall into a small trap when you force major compaction in that many small 
 tables turn into one big one, from their it is hard to get back to many 
 smaller ones again, the other side of the coin if you do not major compact 
 you can end up with much more disk usage then live data (IE large % of disk 
 is overwrites and tombstones).
 
 You can tune the compaction rate now so compaction does not kill your IO. 
 Generally I think avoiding really large SSTables is the best way to do. Scale 
 out and avoid very large SSTables/node if possible.
 
 Edward
 
 
 On Wed, Jun 22, 2011 at 12:35 PM, Jonathan Colby jonathan.co...@gmail.com 
 wrote:
 
 The way compaction works,  x same-sized files are merged into a new 
 SSTable.  This repeats itself and the SSTable get bigger and bigger.
 
 So what is the upper limit?? If you are not deleting stuff fast enough, 
 wouldn't the SSTable sizes grow indefinitely?
 
 I ask because we have some rather large SSTable files (80-100 GB) and I'm 
 starting to worry about future compactions.
 
 Second, compacting such large files is an IO killer.What can be tuned 
 other than compaction_threshold to help optimize this and prevent the files 
 from getting too big?
 
 Thanks!
 



Re: simple question about merged SSTable sizes

2011-06-22 Thread Jonathan Colby
Thanks Ryan.  Done that : )   1 TB is the striped  size.We might look into 
bigger disks for our blades.

On Jun 22, 2011, at 7:09 PM, Ryan King wrote:

 On Wed, Jun 22, 2011 at 10:00 AM, Jonathan Colby
 jonathan.co...@gmail.com wrote:
 Thanks for the explanation.  I'm still a bit skeptical.
 
 So if you really needed to control the maximum size of compacted SSTables,  
 you need to delete data at such a rate that the new files created by 
 compaction are less than or equal to the sum of the segments being merged.
 
 Is anyone else running into really large compacted SSTables that gave you 
 trouble with hard disk capacity?  How did you deal with it?
 
 We have 1 TB disks in our nodes, but keeping in mind we need to have at 
 least 50% for the worst case compaction scenario I'm still a bit worried 
 that one day we're going to hit a dead end.
 
 You should stripe those disks together with RAID-0.
 
 -ryan



Re: simple question about merged SSTable sizes

2011-06-22 Thread Jonathan Colby
Awesome tip on TTL.  We can really use this as a catch-all to make sure all 
columns are purged based on time.  Fits our use-case good.  I forgot this 
feature existed.


On Jun 22, 2011, at 7:11 PM, Eric tamme wrote:

 Second, compacting such large files is an IO killer.What can be tuned
 other than compaction_threshold to help optimize this and prevent the files
 from getting too big?
 
 Thanks!
 
 
 
 Just a personal implementation note - I make heavy use of column TTL,
 so I have very specifically tuned cassandra to having a pretty
 constant max disk usage based on my data insertion rate, the TTL, the
 memtable flush threshold, and min compaction threshold.  My data
 basically lives for 7 days and depending on where it is in the
 compaction cycle goes from 130 gigs per node up to 160gigs per node.
 
 If setting TTL is an option for you, It is one way to auto purge data
 and keep overall size in check.
 
 -Eric



Re: New web client future API

2011-06-20 Thread Jonathan Colby
I just took a look at the demo.   This is really great stuff!   I will try this 
on our cluster as soon as possible.   I like this because it allows people not 
too familiar with the cassandra CLI or Thrift a way to query cassandra data.



On Jun 20, 2011, at 10:56 AM, Markus Wiesenbacher | Codefreun.de wrote:

 Should work now ...
 
 Von meinem iPhone gesendet
 
 Am 20.06.2011 um 09:28 schrieb Andrey V. Panov panov.a...@gmail.com:
 
 How to download it?
 Your Download war-file open just blank page :(
 
 On 14/06/2011, Markus Wiesenbacher | Codefreun.de m...@codefreun.de wrote:
 
 I just released an early version of my web client
 (http://www.codefreun.de/apollo) which is Thrift-based, and therefore I
 would like to know what the future is ...



Re: jsvc hangs shell

2011-06-17 Thread Jonathan Colby
jsvc is not very flexible.  Check out wrapper software out. we swear by it.

http://wrapper.tanukisoftware.com/doc/english/download.jsp


On Jun 17, 2011, at 2:52 AM, Ken Brumer wrote:

 
 Anton Belyaev anton.belyaev at gmail.com writes:
 
 
 I guess it is not trivial to modify the package to make it use JSW
 instead of JSVC.
 I am still not sure the JSVC itself is a culprit. Maybe something is
 wrong in my setup.
 
 
 
 
 
 I am seeing similar behavior using the Brisk Debian packages for Maverick:
 
 http://www.datastax.com/docs/0.8/brisk/install_brisk_packages#installing-the-brisk-packaged-releases
 
 Not sure if it's my configuration, but I verified in on two seperate installs.
 
 -Ken 
 
 
 
 
 
 



Re: Re: minor vs major compaction and purging data

2011-06-13 Thread jonathan . colby
Cleanup removes any data that node is no longer responsible for, according  
to the node's token range. A node can have data it is no longer responsible  
for if you do certain maintenance operations like move or loadbalance.


On , Sebastien Coutu sco...@openplaces.org wrote:
How about cleanups? What would be the difference between cleanup and  
compactions?



On Sat, Jun 11, 2011 at 8:14 AM, Jonathan Ellis jbel...@gmail.com wrote:



Yes.






On Sat, Jun 11, 2011 at 6:08 AM, Jonathan Colby



jonathan.co...@gmail.com wrote:


 I've been reading inconsistent descriptions of what major and minor  
compactions do. So my question for clarification:






 Are tombstones purges (ie, space reclaimed) for minor AND major  
compactions?







 Thanks.











--



Jonathan Ellis



Project Chair, Apache Cassandra



co-founder of DataStax, the source for professional Cassandra support



http://www.datastax.com







minor vs major compaction and purging data

2011-06-11 Thread Jonathan Colby
I've been reading inconsistent descriptions of what major and minor compactions 
do. So my question for clarification:

Are tombstones purges (ie, space reclaimed) for minor AND major compactions?

Thanks.

Compacting Large Row

2011-06-11 Thread Jonathan Colby
I'm seeing this in my logs.   We are storing emails in cassandra and some of 
them might be rather large.

Is this bad?  What exactly is happening when this appears?

 INFO [CompactionExecutor:1] 2011-06-11 13:39:19,217 CompactionIterator.java 
(line 150) Compacting large row 
39653235326331302d626530362d346339362d383966302d646338366366353237663565 
(67149805 bytes) incrementally
 INFO [CompactionExecutor:1] 2011-06-11 13:40:55,215 CompactionIterator.java 
(line 150) Compacting large row 
63343864303464622d336336332d343036322d386130392d343737373766343439643539 
(70605320 bytes) incrementally
 INFO [CompactionExecutor:1] 2011-06-11 13:43:27,353 CompactionIterator.java 
(line 150) Compacting large row 
39353463363062612d646364612d346137382d613838652d633130613439663664353532 
(72450230 bytes) incrementally
 INFO [CompactionExecutor:1] 2011-06-11 13:46:04,439 CompactionIterator.java 
(line 150) Compacting large row 
613634392d656135332d343565382d393265662d303336363731666365376439 
(72007535 bytes) incrementally
 INFO [CompactionExecutor:1] 2011-06-11 13:46:57,517 CompactionIterator.java 
(line 150) Compacting large row 
31636532356365332d323566632d343535382d623232312d363934636538333432323330 
(75976735 bytes) incrementally

Thanks Jon

after a while nothing happening with repair

2011-06-09 Thread Jonathan Colby
When I run repair on a node in my 0.7.6-2 cluster, the repair starts to stream 
data and activity is seen in the logs.

However, after a while (a day or so) it seems like everything freezes up.   The 
repair command is still running (the command prompt has not returned) and 
netstats shows output similar to below.  All streams at 0% and nothing 
happening.  The logs indicate that things were started but there is no 
indication if anything is in fact still active.

For example, this is the last log entry related to repair, just this morning:

 INFO [StreamStage:1] 2011-06-09 07:13:21,423 StreamOut.java (line 173) Stream 
context metadata [/var/lib/cassandra/data/DFS/main-f-144-Data.db sections=2 
progress=0/31947748 - 0%, /var/lib/cassandra/data/DFS/main-f-145-Data.db section
s=2 progress=0/25786564 - 0%, /var/lib/cassandra/data/DFS/main-f-143-Data.db 
sections=2 progress=0/5830103399 - 0%], 9 sstables.
 INFO [StreamStage:1] 2011-06-09 07:13:21,423 StreamOutSession.java (line 174) 
Streaming to /10.46.108.104


However, netstats on all related notes looks something like this.  The nodes 
continue to handle read/write requests just  fine. They are not overloaded at 
all.

Any advice would be greatly appreciated.  Because repairs seem like they never 
finish, I have a feeling we have a lot of garbage data in our cluster.

/opt/cassandra/bin/nodetool -h $HOSTNAME -p 35014 netstats 
Mode: Normal
Not sending any streams.
Streaming from: /10.46.108.104
   DFS: /var/lib/cassandra/data/DFS/main-f-209-Data.db sections=2 
progress=0/276461810 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-153-Data.db sections=2 
progress=0/100340568 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-40-Data.db sections=2 
progress=0/62726190502 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-180-Data.db sections=1 
progress=0/158898493 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-109-Data.db sections=2 
progress=0/87250515569 - 0%
Streaming from: /10.47.108.102
   DFS: /var/lib/cassandra/data/DFS/main-f-304-Data.db sections=2 
progress=0/13563864214 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-350-Data.db sections=1 
progress=0/2877129955 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-379-Data.db sections=2 
progress=0/143804948 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-370-Data.db sections=2 
progress=0/683716174 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-371-Data.db sections=2 
progress=0/56650 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-368-Data.db sections=2 
progress=0/4005533616 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-369-Data.db sections=2 
progress=0/155515922 - 0%
Streaming from: /10.46.108.103
   DFS: /var/lib/cassandra/data/DFS/main-f-888-Data.db sections=2 
progress=0/158096259 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-828-Data.db sections=1 
progress=0/29508276 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-886-Data.db sections=2 
progress=0/133704150 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-759-Data.db sections=2 
progress=0/83629797522 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-889-Data.db sections=2 
progress=0/96903803 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-751-Data.db sections=2 
progress=0/17944852950 - 0%
Streaming from: /10.46.108.101
   DFS: /var/lib/cassandra/data/DFS/main-f-1318-Data.db sections=2 
progress=0/60617216778 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-1179-Data.db sections=2 
progress=0/11870790009 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-1324-Data.db sections=2 
progress=0/710603722 - 0%
   DFS: /var/lib/cassandra/data/DFS/main-f-1322-Data.db sections=2 
progress=0/5844992187 - 0%



fixing unbalanced cluster !?

2011-06-09 Thread Jonathan Colby
I got myself into a situation where one node (10.47.108.100) has a lot more 
data than the other nodes.   In fact, the 1 TB disk on this node is almost 
full.  I added 3 new nodes and let cassandra automatically calculate new tokens 
by taking the highest loaded nodes.  Unfortunately there is still a big token 
range this  node is responsible for (5113... -  85070...).  Yes, I know that 
one option would be to rebalance the entire cluster with move but this is an 
extremely time-consuming and error-prone process because of the amount of data 
involved.  

Our RF = 3 and we read/write quorum.   The nodes have been repaired so I think 
the data should be in good shape.

Question:Can I get myself out of this mess without installing new nodes?
I was thinking of either decommission or removetoken to have the cluster 
rebalance itself.  The re-bootstrap this node to a new token.


Address Status State   LoadOwnsToken
   
   
127605887595351923798765477786913079296 
10.46.108.100   Up Normal  218.52 GB   25.00%  0
   
10.46.108.101   Up Normal  260.04 GB   12.50%  
21267647932558653966460912964485513216  
10.46.108.104   Up Normal  286.79 GB   17.56%  
51138582157040063602728874106478613120  
10.47.108.100   Up Normal  874.91 GB   19.94%  
85070591730234615865843651857942052863  
10.47.108.102   Up Normal  302.79 GB   4.16%   
92156241323118845370666296304459139297  
10.47.108.103   Up Normal  242.02 GB   4.16%   
99241191538897700272878550821956884116  
10.47.108.101   Up Normal  439.9 GB8.34%   
113427455640312821154458202477256070484 
10.46.108.103   Up Normal  304 GB  8.33%   
127605887595351923798765477786913079296  

Re: fixing unbalanced cluster !?

2011-06-09 Thread Jonathan Colby
Thanks Ben.   That's what I was afraid I had to do.  I can see how it's a lot 
easier if you simply double the cluster when adding capacity.

Jon
 
On Jun 9, 2011, at 4:44 PM, Benjamin Coverston wrote:

 Because you were able to successfully run repair you can follow up with a 
 nodetool cleanup which will git rid of some of the extraneous data on that 
 (bigger) node. You're also assured after you run repair that entropy beteen 
 the nodes is minimal.
 
 Assuming you're using the random ordered partitioner: To balance your ring I 
 would start by calculating the new token locations, then moving each of your 
 nodes backwards along their owned range to their new locations.
 
 From the script on http://wiki.apache.org/cassandra/Operations your new 
 balanced tokens would be:
 
 0
 21267647932558653966460912964485513216
 42535295865117307932921825928971026432
 63802943797675961899382738893456539648
 85070591730234615865843651857942052864
 106338239662793269832304564822427566080
 127605887595351923798765477786913079296
 148873535527910577765226390751398592512
 
 From this you can see that  10.46.108.{100, 101} is already in the right 
 place so you don't have to do anything with those nodes. Proceed with moving 
 10.46.108.104 to its new token, the safest way to do this would be to use 
 nodetool move. Another way to do it could be to run a remove-token followed 
 by re-adding the node into the ring at its new location. The risk here is 
 that if you do not at least repair after re-joining the ring (and before you 
 move the next node in the ring) then some of the data on that node would be 
 ignored as it would now fall out of the owned range, so it's good practice to 
 immediately run repair on a node that you do a removetoken / re-join on.
 
 The rest of your balancing should be an iteration on the above steps moving 
 through the range.
 
 
 On 6/9/11 6:21 AM, Jonathan Colby wrote:
 I got myself into a situation where one node (10.47.108.100) has a lot more 
 data than the other nodes.   In fact, the 1 TB disk on this node is almost 
 full.  I added 3 new nodes and let cassandra automatically calculate new 
 tokens by taking the highest loaded nodes.  Unfortunately there is still a 
 big token range this  node is responsible for (5113... -  85070...).  Yes, I 
 know that one option would be to rebalance the entire cluster with move but 
 this is an extremely time-consuming and error-prone process because of the 
 amount of data involved.
 
 Our RF = 3 and we read/write quorum.   The nodes have been repaired so I 
 think the data should be in good shape.
 
 Question:Can I get myself out of this mess without installing new nodes? 
I was thinking of either decommission or removetoken to have the cluster 
 rebalance itself.  The re-bootstrap this node to a new token.
 
 
 Address Status State   LoadOwnsToken

 127605887595351923798765477786913079296
 10.46.108.100   Up Normal  218.52 GB   25.00%  0
 10.46.108.101   Up Normal  260.04 GB   12.50%  
 21267647932558653966460912964485513216
 10.46.108.104   Up Normal  286.79 GB   17.56%  
 51138582157040063602728874106478613120
 10.47.108.100   Up Normal  874.91 GB   19.94%  
 85070591730234615865843651857942052863
 10.47.108.102   Up Normal  302.79 GB   4.16%   
 92156241323118845370666296304459139297
 10.47.108.103   Up Normal  242.02 GB   4.16%   
 99241191538897700272878550821956884116
 10.47.108.101   Up Normal  439.9 GB8.34%   
 113427455640312821154458202477256070484
 10.46.108.103   Up Normal  304 GB  8.33%   
 127605887595351923798765477786913079296
 
 -- 
 Ben Coverston
 Director of Operations
 DataStax -- The Apache Cassandra Company
 http://www.datastax.com/
 



no additional log output after running repair

2011-05-31 Thread Jonathan Colby
I'm trying to run a repair on a 7.6-2 Node.  After running the repair command, 
this line shows up in the cassandra.log, but nothing else.  It's been hours.
 Nothing is seen in the logs from other servers or with nodetool commands like 
netstats or tpstats.

How do  I know if the repair is actually going on or not?   This is incredibly 
frustrating.

 INFO [manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224] 2011-05-31 
14:05:25,625 AntiEntropyService.java (line 786) Waiting for repair requests: 
[#TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224
, /10.47.108.100, (DFS,main), #TreeRequest 
manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224, /10.47.108.103, 
(DFS,main), #TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224, 
/10.46.108.103, (DFS
,main), #TreeRequest manual-repair-9629edfc-7ae9-4626-b90a-2aa6eb1e8224, 
/10.46.108.101, (DFS,main)]


Jon

Re: exception when adding a node replication factor (3) exceeds number of endpoints (1) - SOLVED

2011-05-28 Thread Jonathan Colby
OK, is seems a phantom node (one that was removed from the cluster)
kept being passed around in gossip as a down endpoint and was messing
up the gossip algorithm.  I had the luxury of being able to stop the
entire cluster and bring the nodes up one by one.  That purged the bad
node from gossip.  Not sure if there was a more elegant way to do
that.

On Fri, May 27, 2011 at 9:28 AM,  jonathan.co...@gmail.com wrote:
 Anyone have any idea what this could mean?
 This is a cluster of 7 nodes, I'm trying to add the 8th node.

 INFO [FlushWriter:1] 2011-05-27 09:22:40,495 Memtable.java (line 164)
 Completed flushing /var/lib/cassandra/data/system/Migrations-f-1-Data.db
 (6358 bytes)
 INFO [FlushWriter:1] 2011-05-27 09:22:40,496 Memtable.java (line 157)
 Writing Memtable-Schema@60230368(2363 bytes, 3 operations)
 INFO [FlushWriter:1] 2011-05-27 09:22:40,562 Memtable.java (line 164)
 Completed flushing /var/lib/cassandra/data/system/Schema-f-1-Data.db (2513
 bytes)
 INFO [GossipStage:1] 2011-05-27 09:22:40,829 Gossiper.java (line 610) Node
 /10.46.108.104 is now part of the cluster
 ERROR [GossipStage:1] 2011-05-27 09:22:40,845
 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor
 java.lang.IllegalStateException: replication factor (3) exceeds number of
 endpoints (1)
 at
 org.apache.cassandra.locator.OldNetworkTopologyStrategy.calculateNaturalEndpoints(OldNetworkTopologyStrategy.java:100)
 at
 org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressRanges(AbstractReplicationStrategy.java:196)
 at
 org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:945)
 at
 org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:896)
 at
 org.apache.cassandra.service.StorageService.handleStateBootstrap(StorageService.java:707)
 at
 org.apache.cassandra.service.StorageService.onChange(StorageService.java:648)
 at
 org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1124)
 at
 org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:643)
 at org.apache.cassandra.gms.Gossiper.handleNewJoin(Gossiper.java:611)
 at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:690)
 at
 org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:60)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 ERROR [GossipStage:1] 2011-05-27 09:22:40,847 AbstractCassandraDaemon.java
 (line 112) Fatal exception in thread Thread[GossipStage:1,5,main]
 java.lang.IllegalStateException: replication factor (3) exceeds number of
 endpoints (1)
 at
 org.apache.cassandra.locator.OldNetworkTopologyStrategy.calculateNaturalEndpoints(OldNetworkTopologyStrategy.java:100)
 at
 org.apache.cassandra.locator.AbstractReplicationStrategy.getAddressRanges(AbstractReplicationStrategy.java:196)
 at
 org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:945)
 at
 org.apache.cassandra.service.StorageService.calculatePendingRanges(StorageService.java:896)
 at
 org.apache.cassandra.service.StorageService.handleStateBootstrap(StorageService.java:707)
 at
 org.apache.cassandra.service.StorageService.onChange(StorageService.java:648)
 at
 org.apache.cassandra.service.StorageService.onJoin(StorageService.java:1124)
 at
 org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:643)
 at org.apache.cassandra.gms.Gossiper.handleNewJoin(Gossiper.java:611)
 at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:690)
 at
 org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:60)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:72)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)


new thing going on with repair in 0.7.6??

2011-05-28 Thread Jonathan Colby
It might just not have occurred to me in the previous 0.7.4 version,
but when I do a repair on a node in v0.7.6, it seems like data is also
synced with neighboring nodes.

My understanding of repair is that the data is reconciled one the node
being repaired. i.e., data is removed or added to that node based on
reading the data on other nodes.

I read another thread about a bug which results in the entire data
being streamed over when you don't specify a CF.  But in my case, we
only have one CF - we're using cassandra as a simple key/value store
so I don't think it applies to my setup.

This is a netstats on the node being repaired. Note how everything is
streaming out to other nodes.  Is this a bug or an improvement?

Mode: Normal
Streaming to: /10.47.108.103
   /var/lib/cassandra/data/DFS/main-f-1833-Data.db sections=2542
progress=6243767484/48128279825 - 12%
   /var/lib/cassandra/data/DFS/main-f-1886-Data.db sections=2146
progress=0/748205318 - 0%
   /var/lib/cassandra/data/DFS/main-f-1854-Data.db sections=2542
progress=0/47640938847 - 0%
   /var/lib/cassandra/data/DFS/main-f-1851-Data.db sections=2502
progress=0/1587416504 - 0%
   /var/lib/cassandra/data/DFS/main-f-1892-Data.db sections=1409
progress=0/175226826 - 0%
   /var/lib/cassandra/data/DFS/main-f-1850-Data.db sections=1108
progress=0/107442430 - 0%
   /var/lib/cassandra/data/DFS/main-f-1859-Data.db sections=2542
progress=0/81697265819 - 0%
Streaming to: /10.46.108.103
   /var/lib/cassandra/data/DFS/main-f-1854-Data.db sections=72
progress=0/303912581 - 0%
   /var/lib/cassandra/data/DFS/main-f-1851-Data.db sections=71
progress=0/24604460 - 0%
   /var/lib/cassandra/data/DFS/main-f-1892-Data.db sections=26
progress=0/30900263 - 0%
   /var/lib/cassandra/data/DFS/main-f-1850-Data.db sections=19
progress=0/150012 - 0%
   /var/lib/cassandra/data/DFS/main-f-1859-Data.db sections=72
progress=0/436200262 - 0%
Streaming to: /10.46.108.101
   /var/lib/cassandra/data/DFS/main-f-1892-Data.db sections=193
progress=0/54332711 - 0%
   /var/lib/cassandra/data/DFS/main-f-1851-Data.db sections=693
progress=0/52937963 - 0%
   /var/lib/cassandra/data/DFS/main-f-1850-Data.db sections=135
progress=0/1323107 - 0%
   /var/lib/cassandra/data/DFS/main-f-1859-Data.db sections=702
progress=0/4220897850 - 0%
 Nothing streaming from /10.47.108.103


average repair/bootstrap durations

2011-05-27 Thread Jonathan Colby
Hi -

Operations  like repair and bootstrap on nodes in our cluster (average
load 150GB each) take a very long time.

By long I mean 1-2 days.   With nodetool netstats I can see the
progress % very slowly progressing.

I guess there are some throttling mechanisms built into cassandra.
And yes there is also production load on these nodes so it is somewhat
understandable. Also some of out compacted data files are as 50-60 GB
each.

I was just wondering if these times are similar to what other people
are experiencing or if there is a serious configuration problem with
our setup.

So what have you guys seen with operations like loadbalance,repair,
cleanup, bootstrap on nodes with large amounts of data??

I'm not seeing too many full garbage collections.  Other minor GCs are
well under a second.

Setup info:
0.7.4
5 GB heap
8 GB  ram
64 bit linux os
AMD quad core HP blades
CMS Garbage collector with default cassandra settings
1 TB raid 0 sata disks
across 2 datacenters, but operations within the same dc take very long too.


This is a netstat output of a bootstrap that has been going on for 3+ hours:

Mode: Normal
Streaming to: /10.47.108.103
   
/var/lib/cassandra/data/DFS/main-f-1541-Data.db/(0,32842490722),(32842490722,139556639427),(139556639427,161075890783)
 progress=94624588642/161075890783 - 58%
   /var/lib/cassandra/data/DFS/main-f-1455-Data.db/(0,660743002)
 progress=0/660743002 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1444-Data.db/(0,32816130132),(32816130132,71465138397),(71465138397,90968640033)
 progress=0/90968640033 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1540-Data.db/(0,931632934),(931632934,2621052149),(2621052149,3236107041)
 progress=0/3236107041 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1488-Data.db/(0,33428780851),(33428780851,110546591227),(110546591227,110851587206)
 progress=0/110851587206 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1542-Data.db/(0,24091168),(24091168,97485080),(97485080,108233211)
 progress=0/108233211 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1544-Data.db/(0,3646406),(3646406,18065308),(18065308,25776551)
 progress=0/25776551 - 0%
   /var/lib/cassandra/data/DFS/main-f-1452-Data.db/(0,676616940)
 progress=0/676616940 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1548-Data.db/(0,6957269),(6957269,48966550),(48966550,51499779)
 progress=0/51499779 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1552-Data.db/(0,237153399),(237153399,750466875),(750466875,898056853)
 progress=0/898056853 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1554-Data.db/(0,45155582),(45155582,195640768),(195640768,247592141)
 progress=0/247592141 - 0%
   /var/lib/cassandra/data/DFS/main-f-1449-Data.db/(0,2812483216)
 progress=0/2812483216 - 0%
   
/var/lib/cassandra/data/DFS/main-f-1545-Data.db/(0,107648943),(107648943,434575065),(434575065,436667186)
 progress=0/436667186 - 0%
Not receiving any streams.
Pool NameActive   Pending  Completed
Commandsn/a 0 134283
Responses   n/a 0 192438


Re: average repair/bootstrap durations

2011-05-27 Thread Jonathan Colby
Thanks Ed!   I was thinking about surrendering more memory to mmap
operations.  I'm going to try bringing the Xmx down to 4G

On Fri, May 27, 2011 at 5:19 PM, Edward Capriolo edlinuxg...@gmail.com wrote:


 On Fri, May 27, 2011 at 9:08 AM, Jonathan Colby jonathan.co...@gmail.com
 wrote:

 Hi -

 Operations  like repair and bootstrap on nodes in our cluster (average
 load 150GB each) take a very long time.

 By long I mean 1-2 days.   With nodetool netstats I can see the
 progress % very slowly progressing.

 I guess there are some throttling mechanisms built into cassandra.
 And yes there is also production load on these nodes so it is somewhat
 understandable. Also some of out compacted data files are as 50-60 GB
 each.

 I was just wondering if these times are similar to what other people
 are experiencing or if there is a serious configuration problem with
 our setup.

 So what have you guys seen with operations like loadbalance,repair,
 cleanup, bootstrap on nodes with large amounts of data??

 I'm not seeing too many full garbage collections.  Other minor GCs are
 well under a second.

 Setup info:
 0.7.4
 5 GB heap
 8 GB  ram
 64 bit linux os
 AMD quad core HP blades
 CMS Garbage collector with default cassandra settings
 1 TB raid 0 sata disks
 across 2 datacenters, but operations within the same dc take very long
 too.


 This is a netstat output of a bootstrap that has been going on for 3+
 hours:

 Mode: Normal
 Streaming to: /10.47.108.103

 /var/lib/cassandra/data/DFS/main-f-1541-Data.db/(0,32842490722),(32842490722,139556639427),(139556639427,161075890783)
         progress=94624588642/161075890783 - 58%
   /var/lib/cassandra/data/DFS/main-f-1455-Data.db/(0,660743002)
         progress=0/660743002 - 0%

 /var/lib/cassandra/data/DFS/main-f-1444-Data.db/(0,32816130132),(32816130132,71465138397),(71465138397,90968640033)
         progress=0/90968640033 - 0%

 /var/lib/cassandra/data/DFS/main-f-1540-Data.db/(0,931632934),(931632934,2621052149),(2621052149,3236107041)
         progress=0/3236107041 - 0%

 /var/lib/cassandra/data/DFS/main-f-1488-Data.db/(0,33428780851),(33428780851,110546591227),(110546591227,110851587206)
         progress=0/110851587206 - 0%

 /var/lib/cassandra/data/DFS/main-f-1542-Data.db/(0,24091168),(24091168,97485080),(97485080,108233211)
         progress=0/108233211 - 0%

 /var/lib/cassandra/data/DFS/main-f-1544-Data.db/(0,3646406),(3646406,18065308),(18065308,25776551)
         progress=0/25776551 - 0%
   /var/lib/cassandra/data/DFS/main-f-1452-Data.db/(0,676616940)
         progress=0/676616940 - 0%

 /var/lib/cassandra/data/DFS/main-f-1548-Data.db/(0,6957269),(6957269,48966550),(48966550,51499779)
         progress=0/51499779 - 0%

 /var/lib/cassandra/data/DFS/main-f-1552-Data.db/(0,237153399),(237153399,750466875),(750466875,898056853)
         progress=0/898056853 - 0%

 /var/lib/cassandra/data/DFS/main-f-1554-Data.db/(0,45155582),(45155582,195640768),(195640768,247592141)
         progress=0/247592141 - 0%
   /var/lib/cassandra/data/DFS/main-f-1449-Data.db/(0,2812483216)
         progress=0/2812483216 - 0%

 /var/lib/cassandra/data/DFS/main-f-1545-Data.db/(0,107648943),(107648943,434575065),(434575065,436667186)
         progress=0/436667186 - 0%
 Not receiving any streams.
 Pool Name                    Active   Pending      Completed
 Commands                        n/a         0         134283
 Responses                       n/a         0         192438

 That is a little long but every case is diffent par. With low requiest load
 and some heavy server iron RAID,RAM you can see a compaction move really
 fast 300 GB in 4-6 hours. With enough load one of these operations
 compact,cleanup,join can get really bogged down to the point where it almost
 does not move. Sometimes that is just the way it is based on how fragmented
 your rows are and how fast your gear is. Not pushing your Cassandra caches
 up to your JVM limit can help. If your heap is often near full you can have
 jvm memory fragmentation which slows things down.

 0.8 has some more tuning options for compaction, multi-threaded, knobs for
 effective rate.

 I notice you are using:
 5 GB heap
 8 GB  ram

 So your RAM/DATA ratio is on the lower site. I think unless you have a good
 use case for row cache less XMx is more, but that is a minor tweak.



Re: Re: nodetool move trying to stream data to node no longer in cluster

2011-05-27 Thread Jonathan Colby
Glad to report I fixed this problem.
1. I added the load_ring_state=false flag
2. I was able to arrange a time where I could take down the whole
cluster and bring it back up.

After that the phantom node disappeared.

On Fri, May 27, 2011 at 12:48 AM,  jonathan.co...@gmail.com wrote:
 Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion on
 removing it as an endpoint with jmx.

 On , aaron morton aa...@thelastpickle.com wrote:
 Off the top of my head the simple way to stop invalid end point state been
 passed around is a full cluster stop. Obviously thats not an option. The
 problem is if one node has the IP is will share it around with the others.



 Out of interest take a look at the o.a.c.db.FailureDetector MBean
 getAllEndpointStates() function. That returns the end point state held by
 the Gossiper. I think you should see the Phantom IP listed in there.



 If it's only on some nodes *perhaps* restarting the node with the JVM
 option -Dcassandra.load_ring_state=false *may* help. That will stop the node
 from loading it's save ring state and force it to get it via gossip. Again,
 if there are other nodes with the phantom IP it may just get it again.



 I'll do some digging and try to get back to you. This pops up from time to
 time and thinking out loud I wonder if it would be possible to add a new
 application state that purges an IP from the ring. e.g.
 VersionedValue.STATUS_PURGED that works with a ttl so it goes through X
 number of gossip rounds and then disappears.



 Hope that helps.





 -

 Aaron Morton

 Freelance Cassandra Developer

 @aaronmorton

 http://www.thelastpickle.com



 On 26 May 2011, at 19:58, Jonathan Colby wrote:



  @Aaron -

 

  Unfortunately I'm still seeing message like:   is down, removing from
  gossip, although with not the same frequency.

 

  And repair/move jobs don't seem to try to stream data to the removed
  node anymore.

 

  Anyone know how to totally purge any stored gossip/endpoint data on
  nodes that were removed from the cluster.  Or what might be happening here
  otherwise?

 

 

  On May 26, 2011, at 9:10 AM, aaron morton wrote:

 

  cool. I was going to suggest that but as you already had the move
  running I thought it may be a little drastic.

 

  Did it show any progress ? If the IP address is not responding there
  should have been some sort of error.

 

  Cheers

 

  -

  Aaron Morton

  Freelance Cassandra Developer

  @aaronmorton

  http://www.thelastpickle.com

 

  On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote:

 

  Seems like it had something to do with stale endpoint information. I
  did a rolling restart of the whole cluster and that seemed to trigger the
  nodes to remove the node that was decommissioned.

 

  On , aaron morton aa...@thelastpickle.com wrote:

  Is it showing progress ? It may just be a problem with the
  information printed out.

 

 

 

  Can you check from the other nodes in the cluster to see if they are
  receiving the stream ?

 

 

 

  cheers

 

 

 

  -

 

  Aaron Morton

 

  Freelance Cassandra Developer

 

  @aaronmorton

 

  http://www.thelastpickle.com

 

 

 

  On 26 May 2011, at 00:42, Jonathan Colby wrote:

 

 

 

  I recently removed a node (with decommission) from our cluster.

 

 

 

  I added a couple new nodes and am now trying to rebalance the
  cluster using nodetool move.

 

 

 

  However,  netstats shows that the node being moved is trying to
  stream data to the node that I already decommissioned yesterday.

 

 

 

  The removed node was powered-off, taken out of dns, its IP is not
  even pingable.   It was never a seed neither.

 

 

 

  This is cassandra 0.7.5 on 64bit linux.   How do I tell the cluster
  that this node is gone?  Gossip should have detected this.  The ring
  commands shows the correct cluster IPs.

 

 

 

  Here is a portion of netstats. 10.46.108.102 is the node which was
  removed.

 

 

 

  Mode: Leaving: streaming data to other nodes

 

  Streaming to: /10.46.108.102

 

 
  /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97

 

  ...

 

 
  5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266

Re: nodetool move trying to stream data to node no longer in cluster

2011-05-26 Thread Jonathan Colby
@Aaron -

Unfortunately I'm still seeing message like:  ip-of-removed-node is down, 
removing from gossip, although with not the same frequency.  

And repair/move jobs don't seem to try to stream data to the removed node 
anymore.

Anyone know how to totally purge any stored gossip/endpoint data on nodes that 
were removed from the cluster.  Or what might be happening here otherwise?


On May 26, 2011, at 9:10 AM, aaron morton wrote:

 cool. I was going to suggest that but as you already had the move running I 
 thought it may be a little drastic. 
 
 Did it show any progress ? If the IP address is not responding there should 
 have been some sort of error. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote:
 
 Seems like it had something to do with stale endpoint information. I did a 
 rolling restart of the whole cluster and that seemed to trigger the nodes to 
 remove the node that was decommissioned.
 
 On , aaron morton aa...@thelastpickle.com wrote:
 Is it showing progress ? It may just be a problem with the information 
 printed out.
 
 
 
 Can you check from the other nodes in the cluster to see if they are 
 receiving the stream ?
 
 
 
 cheers
 
 
 
 -
 
 Aaron Morton
 
 Freelance Cassandra Developer
 
 @aaronmorton
 
 http://www.thelastpickle.com
 
 
 
 On 26 May 2011, at 00:42, Jonathan Colby wrote:
 
 
 
 I recently removed a node (with decommission) from our cluster.
 
 
 
 I added a couple new nodes and am now trying to rebalance the cluster 
 using nodetool move.
 
 
 
 However,  netstats shows that the node being moved is trying to stream 
 data to the node that I already decommissioned yesterday.
 
 
 
 The removed node was powered-off, taken out of dns, its IP is not even 
 pingable.   It was never a seed neither.
 
 
 
 This is cassandra 0.7.5 on 64bit linux.   How do I tell the cluster that 
 this node is gone?  Gossip should have detected this.  The ring commands 
 shows the correct cluster IPs.
 
 
 
 Here is a portion of netstats. 10.46.108.102 is the node which was removed.
 
 
 
 Mode: Leaving: streaming data to other nodes
 
 Streaming to: /10.46.108.102
 
  
 /var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97
 
 ...
 
 5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)
 
progress=280574376402/12434049900 - 2256%
 
 .
 
 
 
 
 
 Note 10.46.108.102 is NOT part of the ring.
 
 
 
 Address Status State   LoadOwnsToken
 
  
 148873535527910577765226390751398592512
 
 10.46.108.100   Up Normal  71.73 GB12.50%  0
 
 10.46.108.101   Up Normal  109.69 GB   12.50%  
 21267647932558653966460912964485513216
 
 10.47.108.100   Up Leaving 281.95 GB   37.50%  
 85070591730234615865843651857942052863   
 10.47.108.102   Up Normal  210.77 GB   0.00%   
 85070591730234615865843651857942052864
 
 10.47.108.101   Up Normal  289.59 GB   16.67%  
 113427455640312821154458202477256070484
 
 10.46.108.103   Up Normal  299.87 GB   8.33%   
 127605887595351923798765477786913079296
 
 10.47.108.103   Up Normal  94.99 GB12.50%  
 148873535527910577765226390751398592511
 
 10.46.108.104   Up Normal  103.01 GB   0.00%   
 148873535527910577765226390751398592512
 
 
 
 
 
 
 
 
 
 



nodetool move trying to stream data to node no longer in cluster

2011-05-25 Thread Jonathan Colby
I recently removed a node (with decommission) from our cluster.

I added a couple new nodes and am now trying to rebalance the cluster using 
nodetool move.

However,  netstats shows that the node being moved is trying to stream data 
to the node that I already decommissioned yesterday.

The removed node was powered-off, taken out of dns, its IP is not even 
pingable.   It was never a seed neither.

This is cassandra 0.7.5 on 64bit linux.   How do I tell the cluster that this 
node is gone?  Gossip should have detected this.  The ring commands shows the 
correct cluster IPs.

Here is a portion of netstats. 10.46.108.102 is the node which was removed.

Mode: Leaving: streaming data to other nodes
Streaming to: /10.46.108.102
   
/var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97
 ...
5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)
 progress=280574376402/12434049900 - 2256%
.


Note 10.46.108.102 is NOT part of the ring.

Address Status State   LoadOwnsToken
   
   
148873535527910577765226390751398592512 
10.46.108.100   Up Normal  71.73 GB12.50%  0
   
10.46.108.101   Up Normal  109.69 GB   12.50%  
21267647932558653966460912964485513216  
10.47.108.100   Up Leaving 281.95 GB   37.50%  
85070591730234615865843651857942052863  - currently being moved
10.47.108.102   Up Normal  210.77 GB   0.00%   
85070591730234615865843651857942052864  
10.47.108.101   Up Normal  289.59 GB   16.67%  
113427455640312821154458202477256070484 
10.46.108.103   Up Normal  299.87 GB   8.33%   
127605887595351923798765477786913079296 
10.47.108.103   Up Normal  94.99 GB12.50%  
148873535527910577765226390751398592511 
10.46.108.104   Up Normal  103.01 GB   0.00%   
148873535527910577765226390751398592512  





Re: Database grows 10X bigger after running nodetool repair

2011-05-25 Thread jonathan . colby
I'm not sure if this is the absolute best advice, but perhaps  
running clean on the data will help cleanup any data that isn't assigned  
to this token - in case you've moved the cluster around before.


Any exceptions in the logs, eg EOF ? I experienced this and it caused the  
repairs to trip up every time. It was fixed with a scrub which rebuilds  
all the tables.


I also turned swap off on my nodes, which is unnecessary overhead since  
mmap manages the virtual memory pretty good.


Be careful about running major compactions. You'll keep fusing all the  
Data into bigger and bigger files, which are harder to perform maintenance  
tasks on in my experience.


Jon

On , Dominic Williams thedwilli...@gmail.com wrote:

Hi,


I've got a strange problem, where the database on a node has inflated 10X  
after running repair. This is not the result of receiving missed data.




I didn't perform repair within my usual 10 day cycle, so followed  
recommended practice:

http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGraceSeconds





The sequence of events was like this:




1) set GCGraceSeconds to some huge value
2) perform rolling upgrade from 0.7.4 to 0.7.6-2
3) run nodetool repair on the first node in cluster ~10pm. It has a ~30G  
database


3) 2.30am decide to leave it running all night and wake up 9am to find  
still running
4) late morning investigation shows that db size has increased to 370G.  
The snapshot folder accounts for only 30G



5) node starts to run out of disk space http://pastebin.com/Sm0B7nfR
6) decide to bail! Reset GCGraceSeconds to 864000 and restart node to  
stop repair


7) as node restarts it deletes a bunch of tmp files, reducing db size  
from 370G to 270G
8) node now constantly performing minor compactions and du rising  
slightly then falling by a greater amount after minor compaction deletes  
sstable



9) gradually disk usage is coming down. Currently at 254G (3pm)
10) performance of node obviously not great!



Investigation of the database reveals the main problem to have occurred  
in a single column family, UserFights. This contains millions of fight  
records from our MMO, but actually exactly the same number as the  
MonsterFights cf. However, the comparative size is





Column Family: MonsterFights
SSTable count: 38
Space used (live): 13867454647



Space used (total): 13867454647 (13G)
Memtable Columns Count: 516
Memtable Data Size: 598770



Memtable Switch Count: 4
Read Count: 514
Read Latency: 157.649 ms.



Write Count: 4059
Write Latency: 0.025 ms.
Pending Tasks: 0



Key cache capacity: 20
Key cache size: 183004
Key cache hit rate: 0.0023566218452145135



Row cache: disabled
Compacted row minimum size: 771
Compacted row maximum size: 943127



Compacted row mean size: 3208





Column Family: UserFights
SSTable count: 549



Space used (live): 185355019679
Space used (total): 219489031691 (219G)
Memtable Columns Count: 483



Memtable Data Size: 560569
Memtable Switch Count: 8
Read Count: 2159



Read Latency: 2589.150 ms.
Write Count: 4080
Write Latency: 0.018 ms.



Pending Tasks: 0
Key cache capacity: 20
Key cache size: 20



Key cache hit rate: 0.03357770764288416
Row cache: disabled
Compacted row minimum size: 925



Compacted row maximum size: 12108970
Compacted row mean size: 503069




These stats were taken at 3pm, and at 1pm UserFights was using 224G  
total, so overall size is gradually coming down.




Another observation is the following appearing in the logs during the  
minor compactions:

Compacting large row 536c69636b5061756c (121235810 bytes) incrementally



The largest number of fights any user has performed on our MMO that I can  
find is short of 10,000. Each fight record is smaller than 1K... so it  
looks like these rows have grown +10X somehow.




The size of UserFights on another replica node, which actually has a  
slightly higher proportion of ring is




Column Family: UserFights



SSTable count: 14
Space used (live): 17844982744
Space used (total): 17936528583 (18G)



Memtable Columns Count: 767
Memtable Data Size: 891153
Memtable Switch Count: 6



Read Count: 2298
Read Latency: 61.020 ms.
Write Count: 4261



Write Latency: 0.104 ms.
Pending Tasks: 0
Key cache capacity: 20



Key cache size: 55172
Key cache hit rate: 0.8079570484581498
Row cache: disabled



Compacted row minimum size: 925
Compacted row maximum size: 12108970
Compacted row mean size: 846477





...




All ideas and suggestions greatly appreciated as always!




Dominic
ria101.wordpress.com





Re: Re: nodetool move trying to stream data to node no longer in cluster

2011-05-25 Thread jonathan . colby
Seems like it had something to do with stale endpoint information. I did a  
rolling restart of the whole cluster and that seemed to trigger the nodes  
to remove the node that was decommissioned.


On , aaron morton aa...@thelastpickle.com wrote:
Is it showing progress ? It may just be a problem with the information  
printed out.




Can you check from the other nodes in the cluster to see if they are  
receiving the stream ?





cheers





-



Aaron Morton



Freelance Cassandra Developer



@aaronmorton



http://www.thelastpickle.com





On 26 May 2011, at 00:42, Jonathan Colby wrote:





 I recently removed a node (with decommission) from our cluster.






 I added a couple new nodes and am now trying to rebalance the cluster  
using nodetool move.






 However, netstats shows that the node being moved is trying to stream  
data to the node that I already decommissioned yesterday.






 The removed node was powered-off, taken out of dns, its IP is not even  
pingable. It was never a seed neither.






 This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster that  
this node is gone? Gossip should have detected this. The ring commands  
shows the correct cluster IPs.






 Here is a portion of netstats. 10.46.108.102 is the node which was  
removed.







 Mode: Leaving: streaming data to other nodes



 Streaming to: /10.46.108.102


  
/var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97



 ...


  
5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)



 progress=280574376402/12434049900 - 2256%



 .











 Note 10.46.108.102 is NOT part of the ring.







 Address Status State Load Owns Token



 148873535527910577765226390751398592512



 10.46.108.100 Up Normal 71.73 GB 12.50% 0


 10.46.108.101 Up Normal 109.69 GB 12.50%  
21267647932558653966460912964485513216


 10.47.108.100 Up Leaving 281.95 GB 37.50%  
85070591730234615865843651857942052863
 10.47.108.102 Up Normal 210.77 GB 0.00%  
85070591730234615865843651857942052864


 10.47.108.101 Up Normal 289.59 GB 16.67%  
113427455640312821154458202477256070484


 10.46.108.103 Up Normal 299.87 GB 8.33%  
127605887595351923798765477786913079296


 10.47.108.103 Up Normal 94.99 GB 12.50%  
148873535527910577765226390751398592511


 10.46.108.104 Up Normal 103.01 GB 0.00%  
148873535527910577765226390751398592512


















extremely high temporary disk utilization 0.7.5

2011-05-21 Thread Jonathan Colby

On each of our nodes we have an average of 80 - 100 GB actual cassandra data on 
1 TB disks.There is normally plenty of capacity on the nodes.  Swap is OFF. 
 OS is Debian 64 bit.

Every once in a while,  the disk usage will skyrocket to  500+ GB, even once 
filling up the 1 TB disk (at least according to linux df).

The thing is, after restarting the cassandra daemon, the disk usage correctly 
reflects the actual data usage.

What could be causing this massive temporary disk allocation?

Is it malloc? Is this an indication that something is not configured 
correctly? Is this a bug?

Any help would be appreciated!

Jon





Re: jsvc hangs shell

2011-05-11 Thread jonathan . colby
We use the Java Service Wrapper from Tanuki Software and are very happy  
with it. It's a lot more robust than jsvc.


http://wrapper.tanukisoftware.com/doc/english/download.jsp

The free community version will be enough in most cases.

Jon

On May 11, 2011 10:30pm, Anton Belyaev anton.bely...@gmail.com wrote:

Hello,





I installed 0.7.5 to my Ubuntu 11.04 64 bit from package at



deb http://www.apache.org/dist/cassandra/debian 07x main





And I met really strange problem.



Any shell command that requires Cassandra's jsvc command line (for



example, ps -ef, or top with cmdline args) - just hangs.



Using STRACE I found out that commands hang during reading



/proc//cmdline.



I tried to cat the file - shell hung.





I tried both OpenJDK and Sun JDK - the bug remains.



I tried 0.6.13 on the same machine - works fine.



I tried 0.7.5 on another machine (with older Ubuntu) - works fine.





I believe this is not a Cassandra bug. But I am not sure where to ask



help with the problem.


Could you please advise what should I check to find out where is the  
problem?





Thanks.



Anton.




Re: What will be the steps for adding new nodes

2011-04-18 Thread Jonathan Colby
Your questions are pretty fundamental.  I recommend reading through the 
documentation to get a better understanding of how Cassandra works.

Here's good documentation from DataStax:

http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity

In a nutshell: you only bootstrap new nodes, all nodes should have the same 
seed list, old nodes don't have to be restarted


On Apr 16, 2011, at 7:48 AM, Roni wrote:

 I have a 0.6.4 Cassandra cluster of two nodes in full replica (replica factor 
 2). I wants to add two more nodes and balance the cluster (replica factor 2).
 I want all of them to be seed's.
  
 What should be the simple steps:
 1. add the AutoBootstraptrue/AutoBootstrap to all the nodes or only the 
 new ones?
 2. add the Seed[new_node]/Seed to the config file of the old nodes 
 before adding the new ones?
 3. do the old node need to be restarted (if no change is needed in their 
 config file)?
  
 TX,
  
  



recurring EOFException exception in 0.7.4

2011-04-15 Thread Jonathan Colby
I've been struggling with these kinds of exceptions for some time now.  I 
thought it might have been a one-time thing, so on the 2 nodes where I saw this 
problem I pulled in fresh data with a repair on an empty data directory.

Unfortunately, this problem is now coming up on a new node that has, up until 
now, not had this problem.

What could be causing this?  Could it be related to encoding?   Why are these 
rows not readable?   

This exception prevents cassandra from doing repairs, and even minor 
compactions.  It also messes up memtable management (with a normal load of 
25GB,  disk goes to almost 100% full on a 500 GB hd).

This is incredibly frustrating.  This is the only pain-point I have had with 
cassandra so far.   By the way, this node was never upgraded - it was 0.7.4 
from the start, so that eliminates format compatibility problems.

ERROR [CompactionExecutor:1] 2011-04-15 21:31:23,479 PrecompactedRow.java (line 
82) Skipping row DecoratedKey(105452551814086725777389040553659117532, 
4d657373616765456e726963686d656e743a313032343937) in 
/var/lib/cassandra/data/DFS/main-f-91-Data.db
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:270)
at 
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:315)
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:272)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
at 
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176)
at 
org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78)
at 
org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:147)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
at 
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
at 
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at 
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
at 
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:449)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:124)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:94)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



Re: Questions about the nodetool ring.

2011-04-12 Thread Jonathan Colby
This is normal when you just add single nodes.   When no token is assigned, 
the new node takes a portion of the ring from the most heavily loaded node.
As a consequence of this, the nodes will be out of balance.

In other words, when you double the amount nodes you would not have this 
problem.

The best way to rebalance the cluster is to generate new tokens and use the 
nodetool move new-token command to rebalance the nodes, one at a time.

After rebalancing you can run cleanup so the nodes get rid of data they no 
longer are responsible for.

links:

http://wiki.apache.org/cassandra/Operations#Range_changes

http://wiki.apache.org/cassandra/Operations#Moving_or_Removing_nodes

http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity



On Apr 12, 2011, at 11:00 AM, Dikang Gu wrote:

 I have 3 cassandra 0.7.4 nodes in a cluster, and I get the ring stats:
 
 [root@yun-phy2 apache-cassandra-0.7.4]# bin/nodetool -h 192.168.1.28 -p 8090 
 ring
 Address Status State   LoadOwnsToken  
  

 109028275973926493413574716008500203721 
 192.168.1.25Up Normal  157.25 MB   69.92%  
 57856537434773737201679995572503935972  
 192.168.1.27Up Normal  201.71 MB   24.28%  
 99165710459060760249270263771474737125  
 192.168.1.28Up Normal  68.12 MB5.80%   
 109028275973926493413574716008500203721
 
 The load and owns vary on each node, is this normal?  And is there a way to 
 balance the three nodes?
 
 Thanks.
 
 -- 
 Dikang Gu
 
 0086 - 18611140205
 



Re: Questions about the nodetool ring.

2011-04-12 Thread Jonathan Colby
when you do a move, the node is decommissioned and bootstrapped. During the 
autobootstrap process the node will not receive reads until bootstrapping is 
complete.  I assume during the decommission phase the node will also be 
unavailable,  someone correct me if I'm wrong.

the ring distribution looks better now.

The ? I get all the time too.   And if you run ring against different 
hosts, the question marks probably appear in different places.   I'm not sure 
if it means there is a problem.  I haven't taken those question marks too 
seriously.



On Apr 12, 2011, at 11:57 AM, Dikang Gu wrote:

 After the nodetool move, I got this:
 
 [root@server3 apache-cassandra-0.7.4]# bin/nodetool -h 10.18.101.213 ring
 Address Status State   LoadOwnsToken  
  

 113427455640312821154458202477256070485 
 10.18.101.211   ?  Normal  82.31 MB33.33%  0  
  
 10.18.101.212   ?  Normal  84.24 MB33.33%  
 56713727820156410577229101238628035242  
 10.18.101.213   Up Normal  54.44 MB33.33%  
 113427455640312821154458202477256070485
 
 Is this correct? Why is the status ? ?
 
 Thanks.
 
 On Tue, Apr 12, 2011 at 5:43 PM, Dikang Gu dikan...@gmail.com wrote:
 The 3 nodes were added to the cluster at the same time, so I'm not sure whey 
 the data vary.
 
 I calculate the tokens and get:
 node 0: 0
 node 1: 56713727820156410577229101238628035242
 node 2: 113427455640312821154458202477256070485
 
 So I should set these tokens to the three nodes?  
 
 And during the time I execute the nodetool move commands, can the cassandra 
 servers serve the front end requests at the same time? Is the data safe?
 
 Thanks.
 
 On Tue, Apr 12, 2011 at 5:15 PM, Jonathan Colby jonathan.co...@gmail.com 
 wrote:
 This is normal when you just add single nodes.   When no token is assigned, 
 the new node takes a portion of the ring from the most heavily loaded node.   
  As a consequence of this, the nodes will be out of balance.
 
 In other words, when you double the amount nodes you would not have this 
 problem.
 
 The best way to rebalance the cluster is to generate new tokens and use the 
 nodetool move new-token command to rebalance the nodes, one at a time.
 
 After rebalancing you can run cleanup so the nodes get rid of data they no 
 longer are responsible for.
 
 links:
 
 http://wiki.apache.org/cassandra/Operations#Range_changes
 
 http://wiki.apache.org/cassandra/Operations#Moving_or_Removing_nodes
 
 http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity
 
 
 
 On Apr 12, 2011, at 11:00 AM, Dikang Gu wrote:
 
  I have 3 cassandra 0.7.4 nodes in a cluster, and I get the ring stats:
 
  [root@yun-phy2 apache-cassandra-0.7.4]# bin/nodetool -h 192.168.1.28 -p 
  8090 ring
  Address Status State   LoadOwnsToken
 
  109028275973926493413574716008500203721
  192.168.1.25Up Normal  157.25 MB   69.92%  
  57856537434773737201679995572503935972
  192.168.1.27Up Normal  201.71 MB   24.28%  
  99165710459060760249270263771474737125
  192.168.1.28Up Normal  68.12 MB5.80%   
  109028275973926493413574716008500203721
 
  The load and owns vary on each node, is this normal?  And is there a way to 
  balance the three nodes?
 
  Thanks.
 
  --
  Dikang Gu
 
  0086 - 18611140205
 
 
 
 
 
 -- 
 Dikang Gu
 
 0086 - 18611140205
 
 
 
 
 -- 
 Dikang Gu
 
 0086 - 18611140205
 



repair never completes with finished successfully

2011-04-12 Thread Jonathan Colby
There are a few other threads related to problems with the nodetool repair in 
0.7.4.  However I'm not seeing any errors, just never getting a message that 
the repair completed successfully.

In my production and test cluster (with just a few MB data)  the repair 
nodetool prompt never returns and the last entry in the cassandra.log is always 
something like:

#TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, 
/10.46.108.102, (DFS,main) completed successfully: 1 outstanding

But I don't see a message, even hours later, that the 1 outstanding request 
finished successfully.

Anyone else experience this?  These are physical server nodes in local data 
centers and not EC2 



Re: repair never completes with finished successfully

2011-04-12 Thread Jonathan Colby
There is no Repair session message either.   It just starts with a message 
like:

INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 
14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: 
[#TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, 
/10.46.108.101, (DFS,main), #TreeRequest 
manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, 
(DFS,main), #TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, 
/10.47.108.102, (DFS,main), #TreeRequest 
manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, (DFS,main)]

NETSTATS:

Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool NameActive   Pending  Completed
Commandsn/a 0 150846
Responses   n/a 0 443183

One node in our cluster still has unreadable rows, where the reads trip up 
every time for certain sstables (you've probably seen my earlier threads 
regarding that).   My suspicion is that the bloom filter read on the node with 
the corrupt sstables is never reporting back to the repair, thereby causing it 
to hang.


What would be great is a scrub tool that ignores unreadable/unserializable 
rows!  : )
 

On Apr 12, 2011, at 2:15 PM, aaron morton wrote:

 Do you see a message starting Repair session  and ending with completed 
 successfully ?
 
 Or do you see any streaming activity using nodetool netstats
 
 Repair can hang if a neighbour dies and fails to send a requested stream. It 
 will timeout after 24 hours (I think). 
 
 Aaron
 
 On 12 Apr 2011, at 23:39, Karl Hiramoto wrote:
 
 On 12/04/2011 13:31, Jonathan Colby wrote:
 There are a few other threads related to problems with the nodetool repair 
 in 0.7.4.  However I'm not seeing any errors, just never getting a message 
 that the repair completed successfully.
 
 In my production and test cluster (with just a few MB data)  the repair 
 nodetool prompt never returns and the last entry in the cassandra.log is 
 always something like:
 
 #TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, 
 /10.46.108.102, (DFS,main)  completed successfully: 1 outstanding
 
 But I don't see a message, even hours later, that the 1 outstanding request 
 finished successfully.
 
 Anyone else experience this?  These are physical server nodes in local data 
 centers and not EC2
 
 
 I've seen this.   To fix it  try a nodetool compact then repair.
 
 
 --
 Karl
 



quick repair tool question

2011-04-12 Thread Jonathan Colby
does a repair just compare the existing data from sstables on the node being 
repaired, or will it figure out which data this node should have and copy it 
in?

I'm trying to refresh all the data for a given node (without reassigning the 
token) starting with an emptied out data directory.

I tried nodetool move, but if I give the same token it previously was assigned 
it doesn't seem to trigger a decommission/bootstrap. 

Thanks.

Re: quick repair tool question

2011-04-12 Thread Jonathan Colby
I think I answered the question myself.  The data is streaming in from other 
replicas even though the node's data dir was emptied out (system dir was left 
alone).   

I'm not sure if this is the kosher way to rebuild the sstable data, but it 
seemed to work.   

/var/lib/cassandra/data # /opt/cassandra/bin/nodetool -h $HOSTNAME -p 35014 
netstats 
Mode: Normal
Not sending any streams.
Streaming from: /10.46.108.100
  DFS: 
/var/lib/cassandra/data/DFS/main-f-85-Data.db/(101772144,192460041),(192460041,267088244)
 progress=0/165316100 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-86-Data.db/(118410757,194489915),(194489915,247653739)
 progress=0/129242982 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-40-Data.db/(4823893695,4850323665),(4850323665,7818579650)
 progress=0/2994685955 - 0%
  DFS: /var/lib/cassandra/data/DFS/main-f-89-Data.db/(0,707948),(707948,2011040)
 progress=0/2011040 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-70-Data.db/(778069440,1015544852),(1015544852,1200443249)
 progress=0/422373809 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-71-Data.db/(119366025,132069485),(132069485,156787816)
 progress=0/37421791 - 0%
Streaming from: /10.47.108.100
  DFS: 
/var/lib/cassandra/data/DFS/main-f-365-Data.db/(0,24748050),(126473995,170409694)
 progress=0/68683749 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-367-Data.db/(0,935041),(935041,2238133)
 progress=0/2238133 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-366-Data.db/(0,4608808),(37713613,46884920)
 progress=0/13780115 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-242-Data.db/(0,1057203157),(3307900143,4339490352)
 progress=0/2088793366 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-352-Data.db/(0,19422069),(81246761,122537002)
 progress=0/60712310 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-225-Data.db/(0,1580865981),(4540941750,6024843721)
 progress=0/3064767952 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-349-Data.db/(0,21720053),(54115405,71716716)
 progress=0/39321364 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-364-Data.db/(0,72606213),(175419693,238159626)
 progress=0/135346146 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-363-Data.db/(0,1184983783),(3458591846,4556646617)
 progress=0/2283038554 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-368-Data.db/(0,756228),(756228,1626647)
 progress=0/1626647 - 0%
  DFS: /var/lib/cassandra/data/DFS/main-f-361-Data.db/(48074007,78009236)
 progress=0/29935229 - 0%
  DFS: 
/var/lib/cassandra/data/DFS/main-f-226-Data.db/(0,3111952321),(8592898278,11484622800)
 progress=0/6003676843 - 0%
Pool NameActive   Pending  Completed
Commandsn/a 0   5765
Responses   n/a 0   9811
On Apr 12, 2011, at 4:59 PM, Jonathan Colby wrote:

 does a repair just compare the existing data from sstables on the node being 
 repaired, or will it figure out which data this node should have and copy 
 it in?
 
 I'm trying to refresh all the data for a given node (without reassigning the 
 token) starting with an emptied out data directory.
 
 I tried nodetool move, but if I give the same token it previously was 
 assigned it doesn't seem to trigger a decommission/bootstrap. 
 
 Thanks.



Re: Cassandra 2 DC deployment

2011-04-12 Thread Jonathan Colby
When the down data center comes back up, the Quorum reads will result in a 
read-repair, so you will get valid data.   Besides that, hinted handoff will 
take care of getting data replicated to a previously down node.

You're example is a little unrealistic because you could theoretically have a 
DC with only one node.  So CL.ONE would work every time.   But if you have more 
than 1 node, you have to decide if your application can tolerate getting NULL 
 for a read if the write hasn't propagated from the responsible node to the 
replica.

disclaimer:  I'm a cassandra novice.

On Apr 12, 2011, at 5:12 PM, Raj N wrote:

 Hi experts,
  We are planning to deploy Cassandra in 2 datacenters. Let assume there 
 are 3 nodes, RF=3, 2 nodes in 1 DC and 1 node in 2nd DC. Under normal 
 operations, we would read and write at QUORUM. What we want to do though is 
 if we lose a datacenter which has 2 nodes, DC1 in this case, we want to 
 downgrade our consistency to ONE. Basically I am saying that whenever there 
 is a partition, then prefer availability over consistency. In order to do 
 this we plan to catch UnavailableException and take corrective action. So try 
 QUORUM under normal circumstances, if unavailable try ONE. My questions -
 Do you guys see any flaws with this approach? 
 What happens when DC1 comes back up and we start reading/writing at QUORUM 
 again? Will we read stale data in this case?
 
 Thanks
 -Raj



Re: Help on decommission

2011-04-12 Thread Jonathan Colby
how long as it been in Leaving status?   Is the cluster under stress test load 
while you are doing the decommission?

On Apr 12, 2011, at 6:53 PM, Baskar Duraikannu wrote:

 I have setup a 4 node cluster for testing. When I setup the cluster, I have 
 setup initial tokens in such a way that each gets 25% of load and then 
 started the node with autobootstrap=false.
  
  
 After all nodes are up, I loaded data using the stress test tool with 
 replication factor of 3.  As per of my testing, I am trying to remove one of 
 the node using nodetool decomission but the node seems to be stuck in 
 leaving status.
  
 How do I check whether it is doing any work at all? Please help
  
  
 [root@localhost bin]# ./nodetool -h 10.140.22.25 ring
 Address Status State   LoadOwnsToken

 127605887595351923798765477786913079296
 10.140.22.66Up Leaving 119.41 MB   25.00%  0
 10.140.22.42Up Normal  116.23 MB   25.00%  
 42535295865117307932921825928971026432
 10.140.22.28Up Normal  119.93 MB   25.00%  
 85070591730234615865843651857942052864
 10.140.22.25Up Normal  116.21 MB   25.00%  
 127605887595351923798765477786913079296
 [root@localhost bin]# ./nodetool -h 10.140.22.66 netstats
 Mode: Leaving: streaming data to other nodes
 Streaming to: /10.140.22.42
/var/lib/cassandra/data/Keyspace1/Standard1-f-1-Data.db/(0,120929157)
  progress=120929157/120929157 - 100%
/var/lib/cassandra/data/Keyspace1/Standard1-f-2-Data.db/(0,3361291)
  progress=0/3361291 - 0%
 Not receiving any streams.
 Pool NameActive   Pending  Completed
 Commandsn/a 0 17
 Responses   n/a 0 108109
 [root@usnynyc1cass02 bin]# ./nodetool -h 10.140.22.42 netstats
 Mode: Normal
 Not sending any streams.
 Streaming from: /10.140.22.66
Keyspace1: 
 /var/lib/cassandra/data/Keyspace1/Standard1-f-2-Data.db/(0,3361291)
  progress=0/3361291 - 0%
 Pool NameActive   Pending  Completed
 Commandsn/a 0 11
 Responses   n/a 0 107879
  
  
 Regards,
 Baskar



Re: flush_largest_memtables_at messages in 7.4

2011-04-12 Thread Jonathan Colby
your jvm heap has reached 78% so cassandra automatically flushes its memtables. 
  you need to explain more about your configuration.   32 or 64 bit OS, what is 
max heap, how much ram installed?

If this happens under stress test conditions its probably understandable.  you 
should look into graphing your memory usage, or use the jconsole to graph heap 
during your tests.

On Apr 12, 2011, at 8:36 PM, mcasandra wrote:

 I am using cassandra 7.4 and getting these messages.
 
 Heap is 0.7802529021498031 full. You may need to reduce memtable and/or
 cache sizes Cassandra will now flush up to the two largest memtables to free
 up memory. Adjust flush_largest_memtables_at threshold in cassandra.yaml if
 you don't want Cassandra to do this automatically
 
 How do I verify that I need to adjust any thresholds? And how to calculate
 correct value?
 
 When I got this message only reads were occuring.
 
 create keyspace StressKeyspace
with replication_factor = 3
and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
 
 use StressKeyspace;
 drop column family StressStandard;
 create column family StressStandard
with comparator = UTF8Type
and keys_cached = 100
and memtable_flush_after = 1440
and memtable_throughput = 128;
 
 nodetool -h dsdb4 tpstats
 Pool NameActive   Pending  Completed
 ReadStage32   281 456598
 RequestResponseStage  0 0 797237
 MutationStage 0 0 499205
 ReadRepairStage   0 0 149077
 GossipStage   0 0 217227
 AntiEntropyStage  0 0  0
 MigrationStage0 0201
 MemtablePostFlusher   0 0   1842
 StreamStage   0 0  0
 FlushWriter   0 0   1841
 FILEUTILS-DELETE-POOL 0 0   3670
 MiscStage 0 0  0
 FlushSorter   0 0  0
 InternalResponseStage 0 0  0
 HintedHandoff 0 0 15
 
 cfstats
 
 Keyspace: StressKeyspace
Read Count: 460988
Read Latency: 38.07654727454945 ms.
Write Count: 499205
Write Latency: 0.007409593253272703 ms.
Pending Tasks: 0
Column Family: StressStandard
SSTable count: 9
Space used (live): 247408645485
Space used (total): 247408645485
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 1878
Read Count: 460989
Read Latency: 28.237 ms.
Write Count: 499205
Write Latency: NaN ms.
Pending Tasks: 0
Key cache capacity: 100
Key cache size: 299862
Key cache hit rate: 0.6031833150384193
Row cache: disabled
Compacted row minimum size: 219343
Compacted row maximum size: 5839588
Compacted row mean size: 497474
 
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/flush-largest-memtables-at-messages-in-7-4-tp6266221p6266221.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: quick repair tool question

2011-04-12 Thread Jonathan Colby
cool!  and I thought I made that one up myself : )

On Apr 13, 2011, at 2:13 AM, Chris Burroughs wrote:

 On 04/12/2011 11:11 AM, Jonathan Colby wrote:
 I'm not sure if this is the kosher way to rebuild the sstable data, but it 
 seemed to work.  
 
 http://wiki.apache.org/cassandra/Operations#Handling_failure
 
 Option #3.
 



Re: repair never completes with finished successfully

2011-04-12 Thread Jonathan Colby
great tips.  I will investigate further with your suggestions in mind.  
Hopefully the problem has gone away since I  pulled in fresh data on the node 
with problems.

On Apr 13, 2011, at 3:54 AM, aaron morton wrote:

 Ah, unreadable rows and in the validation compaction no less. Makes a little 
 more sense now. 
 
 Anyone help with the EOF when deserializing columns ? Is the fix to run scrub 
 or drop the sstable ?
 
 Here's a a theory, AES is trying to...
 
 1) Create TreeRequest 's that specify a range we want to validate. 
 2) Send TreeRequest 's to local node and neighbour
 3) Process TreeRequest by running a validation compaction 
 (CompactionManager.doValidationCompaction in your prev stacks)
 4) When both TreeRequests return back work out the differences and then 
 stream data if needed. 
 
 Perhaps step 3 is not completing because of errors like 
 http://www.mail-archive.com/user@cassandra.apache.org/msg12196.html If the 
 row is over multiple sstables we can skip the row in one sstable. However if 
 it's in a single sstable PrecompactedRow will raise an IOError if there is a 
 problem. This is not what is in the linked error stack that shows a row been 
 skipped, just a hunch we could checkout.
 
 Do you see an IOErrors (not exceptions) in the logs or exceptions with 
 doValidationCompaction in the stack?
 
 For a tree request on the node you start compaction on you should see these 
 logs...
 1) Waiting for repair requests...
 2) One of Stored local tree or Stored remote tree depending on which 
 returns first at DEBUG level
 3) Queuing comparison
 
 If we do not have the 3rd log then we did not get a replay from either local 
 or remote. 
 
 Aaron
 
 On 13 Apr 2011, at 00:57, Jonathan Colby wrote:
 
 There is no Repair session message either.   It just starts with a message 
 like:
 
 INFO [manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723] 2011-04-10 
 14:00:59,051 AntiEntropyService.java (line 770) Waiting for repair requests: 
 [#TreeRequest manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, 
 /10.46.108.101, (DFS,main), #TreeRequest 
 manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.100, 
 (DFS,main), #TreeRequest 
 manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.102, 
 (DFS,main), #TreeRequest 
 manual-repair-2af33a51-f46a-4ba2-b1fb-ead5159dc723, /10.47.108.101, 
 (DFS,main)]
 
 NETSTATS:
 
 Mode: Normal
 Not sending any streams.
 Not receiving any streams.
 Pool NameActive   Pending  Completed
 Commandsn/a 0 150846
 Responses   n/a 0 443183
 
 One node in our cluster still has unreadable rows, where the reads trip up 
 every time for certain sstables (you've probably seen my earlier threads 
 regarding that).   My suspicion is that the bloom filter read on the node 
 with the corrupt sstables is never reporting back to the repair, thereby 
 causing it to hang.
 
 
 What would be great is a scrub tool that ignores unreadable/unserializable 
 rows!  : )
 
 
 On Apr 12, 2011, at 2:15 PM, aaron morton wrote:
 
 Do you see a message starting Repair session  and ending with completed 
 successfully ?
 
 Or do you see any streaming activity using nodetool netstats
 
 Repair can hang if a neighbour dies and fails to send a requested stream. 
 It will timeout after 24 hours (I think). 
 
 Aaron
 
 On 12 Apr 2011, at 23:39, Karl Hiramoto wrote:
 
 On 12/04/2011 13:31, Jonathan Colby wrote:
 There are a few other threads related to problems with the nodetool 
 repair in 0.7.4.  However I'm not seeing any errors, just never getting a 
 message that the repair completed successfully.
 
 In my production and test cluster (with just a few MB data)  the repair 
 nodetool prompt never returns and the last entry in the cassandra.log is 
 always something like:
 
 #TreeRequest manual-repair-f739ca7a-bef8-4683-b249-09105f6719d9, 
 /10.46.108.102, (DFS,main)  completed successfully: 1 outstanding
 
 But I don't see a message, even hours later, that the 1 outstanding 
 request finished successfully.
 
 Anyone else experience this?  These are physical server nodes in local 
 data centers and not EC2
 
 
 I've seen this.   To fix it  try a nodetool compact then repair.
 
 
 --
 Karl
 
 
 



Re: unrepairable sstable data rows

2011-04-11 Thread Jonathan Colby
Thanks for the answer Aaron. 

There are Data, Index, Filter, and Statistics files associated with SSTables.   
What files must be physically moved/deleted? 

I tried just moving the Data file and Cassandra would not start. I see this 
exception:

 WARN [WrapperSimpleAppMain] 2011-04-11 12:04:23,239 ColumnFamilyStore.java 
(line 493) Removing orphans for /var/lib/cassandra/data/DFS/main-f-5: [Data.db]
ERROR [WrapperSimpleAppMain] 2011-04-11 12:04:23,240 
AbstractCassandraDaemon.java (line 333) Exception encountered during startup.
java.lang.AssertionError: attempted to delete non-existing file main-f-5-Data.db
at 
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:46)
at 
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:41) 
   at 
org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:498)
at 
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:153)

On Apr 11, 2011, at 2:14 AM, aaron morton wrote:

 But if you wanted to get fresh data on the node, a simple approach is to 
 delete/move just the SSTable that is causing problems then run a repair. That 
 should reduce the amount of data that needs to be moved. 



exceptions during bootstrap 0.7.4

2011-04-11 Thread Jonathan Colby
Seeing these exceptions on a node during the bootstrap phase of a move .   
Cassandra 0.7.4.  Anyone able to shed more light on what may be causing this?

btw - the move was done to assign a new token, decommission phase seemed to 
have gone ok.  bootstrapping is still in progress (i hope)

 INFO [CompactionExecutor:1] 2011-04-11 16:26:25,583 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-249
 INFO [CompactionExecutor:1] 2011-04-11 16:27:21,067 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-250
 INFO [CompactionExecutor:1] 2011-04-11 16:28:01,745 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-251
 INFO [CompactionExecutor:1] 2011-04-11 16:36:21,320 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-252
 INFO [CompactionExecutor:1] 2011-04-11 16:36:33,485 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-253
ERROR [CompactionExecutor:1] 2011-04-11 16:36:34,368 
AbstractCassandraDaemon.java (line 112) Fatal exception in thread 
Thread[CompactionExecutor:1,1,main]
java.io.EOFException
at 
org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65)
at 
org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:315)
at 
org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:942)
at 
org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:935)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [Thread-329] 2011-04-11 16:36:34,369 AbstractCassandraDaemon.java (line 
112) Fatal exception in thread Thread[Thread-329,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.io.EOFException
at 
org.apache.cassandra.streaming.StreamInSession.closeIfFinished(StreamInSession.java:151)
at 
org.apache.cassandra.streaming.IncomingStreamReader.read(IncomingStreamReader.java:63)
at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:91)
Caused by: java.util.concurrent.ExecutionException: java.io.EOFException
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.cassandra.streaming.StreamInSession.closeIfFinished(StreamInSession.java:135)
... 2 more
Caused by: java.io.EOFException
at 
org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65)
at 
org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:315)
at 
org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:942)
at 
org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:935)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
 INFO [CompactionExecutor:1] 2011-04-11 16:36:37,317 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-255
 INFO [CompactionExecutor:1] 2011-04-11 16:36:37,426 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-256
ERROR [CompactionExecutor:1] 2011-04-11 16:36:38,290 
AbstractCassandraDaemon.java (line 112) Fatal exception in thread 
Thread[CompactionExecutor:1,1,main]
java.io.EOFException
at 
org.apache.cassandra.io.sstable.IndexHelper.skipIndex(IndexHelper.java:65)
at 
org.apache.cassandra.io.sstable.SSTableWriter$Builder.build(SSTableWriter.java:315)
at 
org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:942)
at 
org.apache.cassandra.db.CompactionManager$9.call(CompactionManager.java:935)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



help! seed node needs to be replaced

2011-04-11 Thread Jonathan Colby

My seed node (1 of 4)  having the wraparound range (token 0) needs to be 
replaced.


Should I bootstrap the node with a new IP, then add it back as a seed?   

Should I run remove token on another node to take over the range?

Re: help! seed node needs to be replaced

2011-04-11 Thread Jonathan Colby
I shutdown cassandra, deleted (with a backup) the contents of the data 
directory and did a nodetool move 0.It seems to be populating the node 
with its range of data.Hope that was a good idea.

On Apr 11, 2011, at 10:38 PM, Jonathan Colby wrote:

 
 My seed node (1 of 4)  having the wraparound range (token 0) needs to be 
 replaced.
 
 
 Should I bootstrap the node with a new IP, then add it back as a seed?   
 
 Should I run remove token on another node to take over the range?



Re: help! seed node needs to be replaced

2011-04-11 Thread Jonathan Colby
Yes.  This node has repeatedly given problems while reading various sstables.  
So I decided to start with a fresh data dir, relying on the fact that with an 
RF=3, the data will be able to be retrieved from the cluster.

Since this is a seed node, I am a little unsure how to proceed.  From 
everything I've read, bootstrapping a seed is not a good idea.  One idea I had 
was to change the IP, bootstrap, and change the IP back.But I just tried 
nodetool move 0 to try, with the hopes that it might work.


On Apr 11, 2011, at 11:31 PM, aaron morton wrote:

 Is this the node that had the earlier EOF error during bootstrap ? 
 
 Aaron
 
 On 12 Apr 2011, at 08:42, Jonathan Colby wrote:
 
 I shutdown cassandra, deleted (with a backup) the contents of the data 
 directory and did a nodetool move 0.It seems to be populating the node 
 with its range of data.Hope that was a good idea.
 
 On Apr 11, 2011, at 10:38 PM, Jonathan Colby wrote:
 
 
 My seed node (1 of 4)  having the wraparound range (token 0) needs to be 
 replaced.
 
 
 Should I bootstrap the node with a new IP, then add it back as a seed?   
 
 Should I run remove token on another node to take over the range?
 
 



unrepairable sstable data rows

2011-04-10 Thread Jonathan Colby
It appears we have several unserializable or unreadable rows.  These were not 
fixed even after doing a scrub  on all nodes -  even though the scrub seemed 
to have completed successfully.

I trying to fix these by doing a repair, but these exceptions are thrown 
exactly when doing a repair.   Anyone run into this issue?  What's the best way 
to fix this?  

I was thinking that flushing and reloading the data with a move (reusing the 
same token) might be a way to get out of this.


Exception seem multiple times for different keys during a repair:

ERROR [CompactionExecutor:1] 2011-04-10 14:05:55,528 PrecompactedRow.java (line 
82) Skipping row DecoratedKey(58054163627659284217684165071269705317, 
64396663313763662d383432622d343439652d623761312d643164663936333738306565) in 
/var/lib/cassandra/data/DFS/main-f-232-Data.db
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268)
at 
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310)
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
at 
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176)
at 
org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78)
at 
org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
at 
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
at 
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at 
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
at 
org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803)
at 
org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56)
at 
org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)


This WARN also seems to come up often during a repair.  Not sure if it related 
to this problem:

 WARN [ScheduledTasks:1] 2011-04-10 14:10:24,991 GCInspector.java (line 149) 
Heap is 0.8675910480028087 full.  You may need to reduce memtable and/or cache 
sizes.  Cassandra will now flush up to the two largest memtables to free up 
memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you 
don't want Cassandra to do this automatically
 WARN [ScheduledTasks:1] 2011-04-10 14:10:24,992 StorageService.java (line 
2206) Flushing ColumnFamilyStore(table='DFS', columnFamily='main') to relieve 
memory pressure
 INFO [ScheduledTasks:1] 2011-04-10 14:10:24,992 ColumnFamilyStore.java (line 
695) switching in a fresh Memtable for main at 
CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1302435708131.log',
 position=28257053)



Re: auto_bootstrap

2011-04-09 Thread Jonathan Colby
I can't explain the technical reason why it's not advisable to bootstrap a 
seed.   However, from what I've read you would bootstrap the node as a non-seed 
first, then add it as seed once it has finished bootstrapping.

On Apr 8, 2011, at 9:30 PM, mcasandra wrote:

 in yaml:
 # Set to true to make new [non-seed] nodes automatically migrate data
 # to themselves from the pre-existing nodes in the cluster. 
 
 Why only non-seed nodes? What if seed nodes need to bootstrap?
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/auto-bootstrap-tp6254993p6254993.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



Re: nodetool move hammers the next node in the ring

2011-04-09 Thread Jonathan Colby
thanks!  I'll be watching this issue closely.

On Apr 9, 2011, at 5:41 AM, Chris Goffinet wrote:

 We also have a ticket open at 
 
 https://issues.apache.org/jira/browse/CASSANDRA-2399
 
 We have observed in production the impact of streaming data to new nodes 
 being added. We actually have our entire dataset in page cache in one of our 
 clusters, our 99th percentiles go from 20ms to 1 second on streaming nodes 
 when bootstrapping in new nodes because of blowing out the page cache during 
 the process. We are hoping to have this addressed soon. I think throttling of 
 streaming would be good too, to help minimize saturating the network card on 
 the streaming node. Dynamic snitch should help with this, we'll try to report 
 back our results very soon on what it looks like for that case.
  
 -Chris
 
 On Apr 8, 2011, at 7:35 PM, aaron morton wrote:
 
 My brain just started working. The streaming for the move may need to be 
 throttled, but once the file has been received the bloom filters, row 
 indexes and secondary indexes are built. That will also take some effort, do 
 you have any secondary indexes? 
 
 If you are doing a move again could you try turing up logging to DEBUG on 
 one of the neighbour nodes. Once the file has been received you will see a 
 message saying Finished {file_name}. Sending ack to {remote_ip}. After 
 this log message the rebuilds will start,  would be interesting to see what 
 is more heavy weight I'm guessing the rebuilds.
 
 This is similar to https://issues.apache.org/jira/browse/CASSANDRA-2156 but 
 that ticket will not cover this case. I've added this use case to the 
 comments, please check there if you want to follow along.
 
 Cheers
 Aaron
 
 
 On 6 Apr 2011, at 16:26, Jonathan Colby wrote:
 
 thanks for the response Aaron.   Our cluster has 6 nodes with 10 GB load on 
 each.   RF=3.AMD 64 bit Blades, Quad Core, 8 GB ram,  running Debian 
 Linux.  Swap off.  Cassandra 0.7.4
 
 
 On Apr 6, 2011, at 2:40 AM, aaron morton wrote:
 
 Not that I know of, may be useful to be able to throttle things. But if 
 the receiving node has little head room it may still be overwhelmed.
 
 Currently there is a single thread for streaming. If we were to throttle 
 it may be best to make it multi threaded with a single concurrent stream 
 per end point. 
 
 Out of interest how many nodes do you have and whats the RF?
 
 Aaron
 
 
 On 6 Apr 2011, at 01:16, Jonathan Colby wrote:
 
 
 When doing a move, decommission, loadbalance, etc.  data is streamed to 
 the next node in such a way that it really strains the receiving node - 
 to the point where it has a problem serving requests.   
 
 Any way to throttle the streaming of data?
 
 
 
 



Is the repair still going on or did it fail because of exceptions?

2011-04-08 Thread Jonathan Colby
It seems on my cluster there are a few unserializable Rows.  I'm trying to run 
a repair on the nodes, but it also seems that the replica nodes have unreadable 
or unserializable rows.The problem is, I cannot determine if the repair is 
still going on, or if was interrupted because of these errors.   It is unclear 
because nothing else related to the repair show up in the logs.  It's been 
about 5 hours and I also don't see anything happening when I perform a 
nodetool netstats on the nodes.  The nodetool repair command is still 
blocking from the console.

On the node I'm trying to repair, I see this after launching a repair:

...
 INFO [manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc] 2011-04-08 
11:41:55,520 AntiEntropyService.java (line 770) Waiting for repair requests: 
[#TreeRequest manual-repair-6160b400-2c82-4ccb-9451-7
9caafd7d3cc, /10.46.108.102, (DFS,main), #TreeRequest 
manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc, /10.46.108.101, 
(DFS,main), #TreeRequest manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc
, /10.46.108.100, (DFS,main), #TreeRequest 
manual-repair-6160b400-2c82-4ccb-9451-79caafd7d3cc, /10.47.108.101, (DFS,main)]
...

In the log of the node 10.46.108.102 where the repair tries to compare the 
replica data,   I see a couple of the below exceptions a few minutes later.
Are the exceptions bad enough to cause the repair to fail?


ERROR [CompactionExecutor:1] 2011-04-08 11:43:01,177 PrecompactedRow.java (line 
82) Skipping row DecoratedKey(1782314446006375058060694305099335169, 
4d657373616765456e726963686d656e743a31343236) in /va
r/lib/cassandra/data/DFS/main-f-177-Data.db
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268)
at 
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310)
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:94)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
at 
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176)
at 
org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78)
at 
org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
at 
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
at 
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at 
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
at 
org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803)
at 
org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56)
at 
org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [CompactionExecutor:1] 2011-04-08 11:43:53,762 PrecompactedRow.java (line 
82) Skipping row DecoratedKey(8073554114801607394928746621229606383, 
34393734663734382d316330302d346164372d61372d3162
3430386661393832) in /var/lib/cassandra/data/DFS/main-f-177-Data.db
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268)
at 
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310)
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267)
:

nodetool netstats reports:

Mode: Normal
Not sending any streams.
Not receiving any streams.
Pool NameActive   Pending  Completed
Commandsn/a 0 526207

Re: consistency ONE and null

2011-04-07 Thread Jonathan Colby
that makes sense.  thanks!

On Apr 7, 2011, at 8:36 AM, Stephen Connolly wrote:

 also there is a configuration parameter that controls the probability of any 
 read request triggering a read repair
 
 - Stephen
 
 ---
 Sent from my Android phone, so random spelling mistakes, random nonsense 
 words and other nonsense are a direct result of using swype to type on the 
 screen
 
 On 7 Apr 2011 07:35, Stephen Connolly stephen.alan.conno...@gmail.com 
 wrote:
  as I understand, the read repair is a background task triggered by the read
  request, but once the consistency requirement has been met you will be given
  a response.
  
  the coordinator at CL.ONE is allowed to return your responce once it has one
  response (empty or not) from any replica. if the first response is empty,
  you get null
  
  - Stephen
  
  ---
  Sent from my Android phone, so random spelling mistakes, random nonsense
  words and other nonsense are a direct result of using swype to type on the
  screen
  On 7 Apr 2011 00:10, Jonathan Colby jonathan.co...@gmail.com wrote:
 
  Let's say you have RF of 3 and a write was written to 2 nodes. 1 was not
  written because the node had a network hiccup (but came back online again).
 
  My question is, if you are reading a key with a CL of ONE, and you happen
  to land on that node that didn't get the write, will the read fail
  immediately?
 
  Or, would read repair check the other replicas and fetch the correct data
  from the other node(s)?
 
  Secondly, is read repair done according to the consistency level, or is
  read repair an independent configuration setting that can be turned on/off.
 
  There was a recent thread about a different variation of my question, but
  went into very technical details, so I didn't want to hijack that thread.



reoccurring exceptions seen

2011-04-07 Thread Jonathan Colby
These types of exceptions is seen sporadically in our cassandra logs.  They 
occur especially after running a repair with the nodetool. 

I assume there are a few corrupt rows.   Is this cause for panic?

 Will a repair fix this, or is it best to do a decomission + bootstrap via a 
move for example?   or would a scrub help here?


ERROR [CompactionExecutor:1] 2011-04-07 15:51:12,093 PrecompactedRow.java (line 
82) Skipping row DecoratedKey(36813508603227779893025154359070714012, 
32326437643439642d623566332d346433392d613334622d343738643433633130383633) in 
/var/lib/cassandra/data/DFS/main-f-164-Data.db
java.io.EOFException
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:383)
at java.io.RandomAccessFile.readFully(RandomAccessFile.java:361)
at 
org.apache.cassandra.io.util.BufferedRandomAccessFile.readBytes(BufferedRandomAccessFile.java:268)
at 
org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:310)
at 
org.apache.cassandra.utils.ByteBufferUtil.readWithLength(ByteBufferUtil.java:267)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76)
at 
org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
at 
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176)
at 
org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78)
at 
org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
at 
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
at 
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at 
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
at 
org.apache.cassandra.db.CompactionManager.doValidationCompaction(CompactionManager.java:803)
at 
org.apache.cassandra.db.CompactionManager.access$800(CompactionManager.java:56)
at 
org.apache.cassandra.db.CompactionManager$6.call(CompactionManager.java:358)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
ERROR [CompactionExecutor:1] 2011-04-07 15:51:26,356




 INFO [MigrationStage:1] 2011-03-11 17:20:10,900 Migration.java (line 136) 
Applying migration 6f6e2a6c-4bfb-11e0-a3ae-87e4c47e8541 Add keyspace: DFSrep 
factor:2rep 
strategy:NetworkTopologyStrategy{org.apache.cassandra.config.CFMetaData@2a4bd173[cfId=1000,tableName=DFS,cfName=main,cfType=Standard,comparator=org.apache.cassandra.db.marshal.BytesType@c16c2c0,subcolumncomparator=null,c...skipping...
at 
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
at 
org.apache.cassandra.io.sstable.SSTableIdentityIterator.getColumnFamilyWithColumns(SSTableIdentityIterator.java:176)
at 
org.apache.cassandra.io.PrecompactedRow.init(PrecompactedRow.java:78)
at 
org.apache.cassandra.io.CompactionIterator.getCompactedRow(CompactionIterator.java:139)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:108)
at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
at 
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:136)
at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:131)
at 
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
at 
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
at 
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:449)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:124)
at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:94)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   

Re: nodetool move hammers the next node in the ring

2011-04-06 Thread Jonathan Colby
thanks for the response Aaron.   Our cluster has 6 nodes with 10 GB load on 
each.   RF=3.AMD 64 bit Blades, Quad Core, 8 GB ram,  running Debian Linux. 
 Swap off.  Cassandra 0.7.4


On Apr 6, 2011, at 2:40 AM, aaron morton wrote:

 Not that I know of, may be useful to be able to throttle things. But if the 
 receiving node has little head room it may still be overwhelmed.
 
 Currently there is a single thread for streaming. If we were to throttle it 
 may be best to make it multi threaded with a single concurrent stream per end 
 point. 
 
 Out of interest how many nodes do you have and whats the RF?
 
 Aaron
 
 
 On 6 Apr 2011, at 01:16, Jonathan Colby wrote:
 
 
 When doing a move, decommission, loadbalance, etc.  data is streamed to the 
 next node in such a way that it really strains the receiving node - to the 
 point where it has a problem serving requests.   
 
 Any way to throttle the streaming of data?
 



Re: Location-aware replication based on objects' access pattern

2011-04-06 Thread Jonathan Colby
good to see a discussion on this. 

This also has practical use for business continuity where you can control that 
the clients in a given data center first write replicas to its own data center, 
then to the other data center for backup.  If I understand correctly, a write 
takes the token into account first, then the replication strategy decides where 
the replicas go.   I would like to see the the first writes to be based on 
location instead of token -   whether that is accomplished by manipulating 
the key or some other mechanism.

That way, if you do suffer the loss of a data center,  the clients are 
guaranteed to meet quorum on the nodes in its own data center  (given  a 
mirrored architecture across 2 data centers).

We have 2 data centers.  If one goes down we have the problem that quorum 
cannot be satisfied for half of the reads.


On Apr 6, 2011, at 6:00 AM, Jonathan Ellis wrote:

 On Tue, Apr 5, 2011 at 10:45 PM, Yudong Gao st...@umich.edu wrote:
 A better solution would be to just push the DecoratedKey into the
 ReplicationStrategy so it can make its decision before information is
 thrown away.
 
 I agree. So in this case, I guess the hashed based token ring is still
 preserved to avoid hot spot, but we further use the DecoratedKey to
 guide the replication strategy. For example, replica 2 is placed in
 the first node along the ring the belongs the desirable data center
 (based on the location hint embedded DecoratedKey). But we may not be
 able to control the primary replica. Do you think this will be a
 reasonable design?
 
 calculateNaturalEndpoints has complete freedom to generate all
 replicas any way it likes.  Thinking of an endpoint as primary
 because it was generated first by one algorithm is dangerous.
 
 As one of the docstrings explains, replica destinations (endpoints)
 should be considered a Set even though we use a List for efficiency.
 None of them are special at the ReplicationStrategy level.
 
 Just curious, are they happy with the current
 solution with keyspace, and is there some requests for per-row
 placement control?
 
 Enough people want to try it that we have the ticket open. :)
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



consistency ONE and null

2011-04-06 Thread Jonathan Colby

Let's say you have RF of 3 and a write was written to 2 nodes.  1 was not 
written because the node had a network hiccup (but came back online again).

My question is, if you are reading a key with a CL of ONE,  and you happen to 
land on that node that didn't get the write, will the read fail immediately?

Or, would read repair check the other replicas and fetch the correct data from 
the other node(s)?

Secondly, is read repair done according to the consistency level, or is read 
repair an independent configuration setting that can be turned on/off.

There was a recent thread about a different variation of my question, but went 
into very technical details, so I didn't want to hijack that thread.

Re: Re: nodetool cleanup - results in more disk use?

2011-04-05 Thread jonathan . colby
I think the key thing to remember is that compaction is performed on  
*similar* sized sstables. so it makes sense that over time this will have a  
cascading effect. I think by default it starts out with compacting 4  
flushed sstables, then the cycle begins.


On Apr 4, 2011 3:42pm, shimi shim...@gmail.com wrote:
The bigger the file the longer it will take for it to be part of a  
compaction again.Compacting bucket of large files takes longer then  
compacting bucket of small files




Shimi



On Mon, Apr 4, 2011 at 3:58 PM, aaron morton aa...@thelastpickle.com  
wrote:



mmm, interesting. My theory was



t0 - major compaction runs, there is now one sstable
t1 - x new sstables have been created
t2 - minor compaction runs and determines there are two buckets, one with  
the x new sstables and one with the single big file. The bucket of many  
files is compacted into one, the bucket of one file is ignored.




I can see that it takes longer for the big file to be involved in  
compaction again, and when it finally was it would take more time. But  
that minor compactions of new SSTables would still happen at the same  
rate, especially if they are created at the same rate as previously.





Am I missing something or am I just reading the docs wrong ?




Cheers
Aaron







On 4 Apr 2011, at 22:20, Jonathan Colby wrote:




hi Aaron -


The Datastax documentation brought to light the fact that over time,  
major compactions will be performed on bigger and bigger SSTables. They  
actually recommend against performing too many major compactions. Which  
is why I am wary to trigger too many major compactions ...





http://www.datastax.com/docs/0.7/operations/scheduled_tasks



Performing Major Compaction¶
A major compaction process merges all SSTables for all column
families in a keyspace – not just similar sized ones, as in minor
compaction. Note that this may create extremely large SStables that
result in long intervals before the next minor compaction (and a
resulting increase in CPU usage for each minor compaction).
Though a major compaction ultimately frees disk space used by
accumulated SSTables, during runtime it can temporarily double disk
space usage. It is best to run major compactions, if at all, at times of
low demand on the cluster.














On Apr 4, 2011, at 1:57 PM, aaron morton wrote:



cleanup reads each SSTable on disk and writes a new file that contains  
the same data with the exception of rows that are no longer in a token  
range the node is a replica for. It's not compacting the files into fewer  
files or purging tombstones. But it is re-writing all the data for the CF.



Part of the process will trigger GC if needed to free up disk space from  
SSTables no longer needed.


AFAIK having fewer bigger files will not cause longer minor compactions.  
Compaction thresholds are applied per bucket of files that share a  
similar size, there is normally more smaller files and fewer larger files.




Aaron



On 2 Apr 2011, at 01:45, Jonathan Colby wrote:


I discovered that a Garbage collection cleans up the unused old SSTables.  
But I still wonder whether cleanup really does a full compaction. This  
would be undesirable if so.





On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote:



I ran node cleanup on a node in my cluster and discovered the disk usage  
went from 3.3 GB to 5.4 GB. Why is this?



I thought cleanup just removed hinted handoff information. I read that  
*during* cleanup extra disk space will be used similar to a compaction.  
But I was expecting the disk usage to go back down when it finished.



I hope cleanup doesn't trigger a major compaction. I'd rather not run  
major compactions because it means future minor compactions will take  
longer and use more CPU and disk.


























if nodetool operations abort with timeout, did the operation continue?

2011-04-05 Thread Jonathan Colby

when doing a nodetool move , after about 15 minutes I got the below 
exception.   The cassandra log seems to indicate that the move is still 
ongoing.   Is this anything to worry about?


Exception in thread main java.rmi.UnmarshalException: Error unmarshaling 
return header; nested exception is: 
java.io.EOFException
at 
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:209)
at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:142)
at com.sun.jmx.remote.internal.PRef.invoke(Unknown Source)
at javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(Unknown 
Source)
at 
javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:993)
at 
javax.management.MBeanServerInvocationHandler.invoke(MBeanServerInvocationHandler.java:288)
at $Proxy0.move(Unknown Source)
at org.apache.cassandra.tools.NodeProbe.move(NodeProbe.java:347)
at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:564)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at 
sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:195)



Disable Swap? batch_mutate failed: out of sequence response

2011-04-05 Thread Jonathan Colby
Hi Jonathan -

Would you recommend to disable system swap as a rule?   I'm running on Debian 
64bit and am seeing light swapping:

 total   used   free sharedbuffers cached
Mem:  8003   7969 33  0  0   4254
-/+ buffers/cache:   3714   4288
Swap:  513 15498




On Apr 5, 2011, at 8:35 PM, Jonathan Ellis wrote:

 Step 1: disable swap.
 
 2011/4/5 Héctor Izquierdo Seliva izquie...@strands.com:
 Update with more info:
 
 I'm still running into problems. Now I don't write more than 100 columns
 at a time, and I'm having lots of Stop-the-world gc pauses.
 
 I'm writing into three column families, with memtable_operations = 0.3
 and memtable_throughput = 64. There is now swapping, and full GCs are taking 
 around 5 seconds. I'm running cassandra with a heap of 8 GB. Should I tune 
 this somehow?
 
 Is any of this wrong?
 
 -Original Message-
 From: Héctor Izquierdo Seliva [mailto:izquie...@strands.com]
 Sent: April-05-11 8:30
 To: user@cassandra.apache.org
 Subject: batch_mutate failed: out of sequence response
 
 Hi everyone. I'm having trouble while inserting big amounts of data into
 cassandra. I'm getting this exception:
 
 batch_mutate failed: out of sequence response
 
 I'm gessing is due to very big mutates. I have made the batch mutates
 smaller and it seems to be behaving. Can somebody shed some light?
 
 Thanks!
 
 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 9.0.894 / Virus Database: 271.1.1/3551 - Release Date: 04/05/11 
 02:34:00
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



extreme memory consumption

2011-04-05 Thread Jonathan Colby
I've seen the other posts about memory consumption, but I'm seeing some weird 
behavior with 0.7.4   with 5 GB heap size   (64 bit system with 8 GB ram 
total)...

note the virtual mem used 20.6 GB ?!   and Shared 8.4 GB ?!

PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND  

  
 2390 root  20   0 000 D1  0.0   0:28.73 flush-104:0

 
31684 cassandr  20   0 20.6g 3.5g 8496 S1 45.4   4:08.91 java   

 
   17 root  20   0 000 S0  0.0   0:38.03 events/2 


What could be going on here?


config 

initial_token: 
auto_bootstrap: true
hinted_handoff_enabled: true
max_hint_window_in_ms: 360 # one hour
hinted_handoff_throttle_delay_in_ms: 50
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authority: org.apache.cassandra.auth.AllowAllAuthority
partitioner: org.apache.cassandra.dht.RandomPartitioner
data_file_directories:
- /var/lib/cassandra/data
commitlog_directory: /var/lib/cassandra/commitlog
saved_caches_directory: /var/lib/cassandra/saved_caches
commitlog_rotation_threshold_in_mb: 128
commitlog_sync: periodic
commitlog_sync_period_in_ms: 1
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
disk_access_mode: auto
concurrent_reads: 16
concurrent_writes: 32
sliced_buffer_size_in_kb: 64
storage_port: 7000

rpc_port: 9160
rpc_keepalive: true
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
snapshot_before_compaction: false
binary_memtable_throughput_in_mb: 256
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
rpc_timeout_in_ms: 1
endpoint_snitch: org.apache.cassandra.locator.RackInferringSnitch
dynamic_snitch: true
dynamic_snitch_update_interval_in_ms: 100 
dynamic_snitch_reset_interval_in_ms: 60
dynamic_snitch_badness_threshold: 0.0
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
keyspaces:
- name: DFS
  replica_placement_strategy: 
org.apache.cassandra.locator.OldNetworkTopologyStrategy
  replication_factor: 3
  column_families:
- name: main
  compare_with: BytesType
  keys_cached: 20
  rows_cached: 200
  row_cache_save_period_in_seconds: 0
  key_cache_save_period_in_seconds: 3600

nothing happening in the cluster after a nodetool move

2011-04-05 Thread Jonathan Colby

I added a node to the cluster and I am having a difficult time reassigning the 
new tokens.

It seems after a while nothing shows up in the new node's logs and it just 
stays in status Leaving.   nodetool netstats   on all nodes shows Nothing 
streaming to/from.

There is no activity in the other logs related to the move. 

The data size is not even that big, around 5 GB.What could be happening?   
Seems like the move is frozen.

Update: Re: nothing happening in the cluster after a nodetool move

2011-04-05 Thread Jonathan Colby
Well, since my last post,  about 10 minutes later, the node goes into bootstrap 
mode.  It's kind of worrying that a lot of time goes by where it seems like 
nothing is happening, then all of a sudden things get going again.


22,584 keys.  Time: 20,276ms.
 INFO [HintedHandoff:1] 2011-04-05 22:29:23,167 HintedHandOffManager.java (line 
304) Started hinted handoff for endpoint /10.46.108.101
 INFO [HintedHandoff:1] 2011-04-05 22:29:23,167 HintedHandOffManager.java (line 
360) Finished hinted handoff of 0 rows to endpoint /10.46.108.101

 LONG PAUSE WHERE NOTHING HAPPENS 


 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:43:38,770 
StorageService.java (line 1637) Announcing that I have left the ring for 3ms
 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,770 
StorageService.java (line 1747) re-bootstrapping to new token 
85070591730234615865843651857942052863
 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,771 
ColumnFamilyStore.java (line 695) switching in a fresh Memtable for 
LocationInfo at 
CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1302035265949.log',
 position=25920946)
 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,771 
ColumnFamilyStore.java (line 1006) Enqueuing flush of 
Memtable-LocationInfo@1358281533(53 bytes, 2 operations)
 INFO [FlushWriter:1] 2011-04-05 22:44:08,772 Memtable.java (line 157) Writing 
Memtable-LocationInfo@1358281533(53 bytes, 2 operations)
 INFO [FlushWriter:1] 2011-04-05 22:44:08,825 Memtable.java (line 164) 
Completed flushing /var/lib/cassandra/data/system/LocationInfo-f-22-Data.db 
(163 bytes)
 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:08,826 
StorageService.java (line 505) Joining: sleeping 3 ms for pending range 
setup
 INFO [RMI TCP Connection(4)-10.46.108.102] 2011-04-05 22:44:38,826 
StorageService.java (line 505) Bootstrapping
 INFO [CompactionExecutor:1] 2011-04-05 22:44:43,952 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-128
 INFO [CompactionExecutor:1] 2011-04-05 22:44:43,978 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-129
 INFO [CompactionExecutor:1] 2011-04-05 22:46:02,228 SSTableReader.java (line 
154) Opening /var/lib/cassandra/data/DFS/main-f-130

On Apr 5, 2011, at 10:46 PM, Jonathan Colby wrote:

 
 I added a node to the cluster and I am having a difficult time reassigning 
 the new tokens.
 
 It seems after a while nothing shows up in the new node's logs and it just 
 stays in status Leaving.   nodetool netstats   on all nodes shows 
 Nothing streaming to/from.
 
 There is no activity in the other logs related to the move. 
 
 The data size is not even that big, around 5 GB.What could be happening?  
  Seems like the move is frozen.



Re: nodetool cleanup - results in more disk use?

2011-04-04 Thread Jonathan Colby
hi Aaron -

The Datastax documentation brought to light the fact that over time, major 
compactions  will be performed on bigger and bigger SSTables.   They actually 
recommend against performing too many major compactions.  Which is why I am 
wary to trigger too many major compactions ...

http://www.datastax.com/docs/0.7/operations/scheduled_tasks
Performing Major Compaction¶

A major compaction process merges all SSTables for all column families in a 
keyspace – not just similar sized ones, as in minor compaction. Note that this 
may create extremely large SStables that result in long intervals before the 
next minor compaction (and a resulting increase in CPU usage for each minor 
compaction).

Though a major compaction ultimately frees disk space used by accumulated 
SSTables, during runtime it can temporarily double disk space usage. It is best 
to run major compactions, if at all, at times of low demand on the cluster.







On Apr 4, 2011, at 1:57 PM, aaron morton wrote:

 cleanup reads each SSTable on disk and writes a new file that contains the 
 same data with the exception of rows that are no longer in a token range the 
 node is a replica for. It's not compacting the files into fewer files or 
 purging tombstones. But it is re-writing all the data for the CF. 
 
 Part of the process will trigger GC if needed to free up disk space from 
 SSTables no longer needed.
 
 AFAIK having fewer bigger files will not cause longer minor compactions. 
 Compaction thresholds are applied per bucket of files that share a similar 
 size, there is normally more smaller files and fewer larger files. 
 
 Aaron
 
 On 2 Apr 2011, at 01:45, Jonathan Colby wrote:
 
 I discovered that a Garbage collection cleans up the unused old SSTables.   
 But I still wonder whether cleanup really does a full compaction.  This 
 would be undesirable if so.
 
 
 On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote:
 
 I ran node cleanup on a node in my cluster and discovered the disk usage 
 went from 3.3 GB to 5.4 GB.  Why is this?
 
 I thought cleanup just removed hinted handoff information.   I read that 
 *during* cleanup extra disk space will be used similar to a compaction.  
 But I was expecting the disk usage to go back down when it finished.
 
 I hope cleanup doesn't trigger a major compaction.  I'd rather not run 
 major compactions because it means future minor compactions will take 
 longer and use more CPU and disk.
 
 
 
 



Re: changing replication strategy and effects on replica nodes

2011-04-01 Thread Jonathan Colby
Hi Aaron -  Yes, I've read the part about changing the replication factor on a 
running cluster.  I've even done it without a problem.  My real point of my 
question was 

 do you now have unused replica data on the old replica nodes that you need 
 to clean up manually?

any insight would be appreciated.

On Apr 1, 2011, at 1:45 PM, aaron morton wrote:

 See the section on Replication here 
 http://wiki.apache.org/cassandra/Operations#Replication It talks about how to 
 change the RF and then says you can do the same when change the placement 
 strategy. 
 
 It can be done, but is a little messy. 
 
 Depending on your setup it may also be possible to copy / move the nodes 
 manually by moving sstable files.  
 
 I've not done it myself, are you able to run a test ?
 
 Hope that helps. 
 Aaron
 
 On 1 Apr 2011, at 02:04, Jonathan Colby wrote:
 
 
 From my understanding of replica copies,  cassandra picks which nodes to 
 replicate the data based on replication strategy, and those same replica 
 partner nodes are always used according to token ring distribution.
 
 If you change the replication strategy,  does cassandra pick new nodes to 
 replicate to?   (for example if you went from simple strategy to a 
 networkTopology strategy where copies are to be sent to another datacenter)
 
 If so,  do you now have unused replica data on the old replica nodes that 
 you need to clean up manually?
 



nodetool cleanup - results in more disk use?

2011-04-01 Thread Jonathan Colby
I ran node cleanup on a node in my cluster and discovered the disk usage went 
from 3.3 GB to 5.4 GB.  Why is this?

I thought cleanup just removed hinted handoff information.   I read that 
*during* cleanup extra disk space will be used similar to a compaction.  But I 
was expecting the disk usage to go back down when it finished.

I hope cleanup doesn't trigger a major compaction.  I'd rather not run major 
compactions because it means future minor compactions will take longer and use 
more CPU and disk.




Re: nodetool cleanup - results in more disk use?

2011-04-01 Thread Jonathan Colby
I discovered that a Garbage collection cleans up the unused old SSTables.   But 
I still wonder whether cleanup really does a full compaction.  This would be 
undesirable if so.


On Apr 1, 2011, at 4:08 PM, Jonathan Colby wrote:

 I ran node cleanup on a node in my cluster and discovered the disk usage went 
 from 3.3 GB to 5.4 GB.  Why is this?
 
 I thought cleanup just removed hinted handoff information.   I read that 
 *during* cleanup extra disk space will be used similar to a compaction.  But 
 I was expecting the disk usage to go back down when it finished.
 
 I hope cleanup doesn't trigger a major compaction.  I'd rather not run major 
 compactions because it means future minor compactions will take longer and 
 use more CPU and disk.
 
 



changing replication strategy and effects on replica nodes

2011-03-31 Thread Jonathan Colby

From my understanding of replica copies,  cassandra picks which nodes to 
replicate the data based on replication strategy, and those same replica 
partner nodes are always used according to token ring distribution.

If you change the replication strategy,  does cassandra pick new nodes to 
replicate to?   (for example if you went from simple strategy to a 
networkTopology strategy where copies are to be sent to another datacenter)

If so,  do you now have unused replica data on the old replica nodes that you 
need to clean up manually?

Re: How to determine if repair need to be run

2011-03-31 Thread Jonathan Colby
silly question, would every cassandra installation need to have manual repairs 
done on it?

It would seem cassandra's read repair and regular compaction would take care 
of keeping the data clean. 

Am I missing something?
 

On Mar 30, 2011, at 7:46 PM, Peter Schuller wrote:

 I just wanted to chime in here and say some people NEVER run repair.
 
 Just so long as the OP is understanding that this implies taking an
 explicit decision to accept the misbehavior you will see as a
 result. I.e., the reason people survive not doing repairs in some
 cases is, as in your case, that they can actually live with the
 consequences such as old data magically re-appearing permanently.
 
 as it really increased on disk data. I have followed some threads and
 there are some conditions that I read repair can't handle. The
 
 For one thing, RR will only touch data that is read. And not even all
 data that is read at that (e.g. range slices don't imply repair).
 
 -- 
 / Peter Schuller



Re: How to determine if repair need to be run

2011-03-31 Thread Jonathan Colby
Peter -

Thanks a lot for elaborating on repairs.Still, it's a bit fuzzy to me why 
it is so important to run a repair before the GCGraceSeconds kicks in.   Does 
this mean a delete does not get replicated ?   In other words when I delete 
something on a node, doesn't cassandra set tombstones on its replica copies?

And technically, isn't repair only needed for cases where things weren't 
properly propogated in the cluster?  If all writes are written to the right 
replicas, and all deletes are written to all the replicas, and all nodes were 
available at all times, then everything should work as designed -  without 
manual intervention, right?

Thanks again.



On Mar 31, 2011, at 6:17 PM, Peter Schuller wrote:

 silly question, would every cassandra installation need to have manual 
 repairs done on it?
 
 It would seem cassandra's read repair and regular compaction would take 
 care of keeping the data clean.
 
 Am I missing something?
 
 See my previous posts in this thread for the distinct reasons to run
 repair. Except in special circumstances where you know exactly what
 you're doing (mainly that no deletes are performed), you are
 *required* to run repair often enough for GGraceSeconds:
 
   http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
 
 It seems that there needs to be some more elaborate documentation
 about this somewhere to point to since there seems to be confusion.
 
 Regular compaction does *not* imply repair. Read repair only works if
 (1) you touch all data within GCGraceSeconds, and (2) you touch it in
 such a way that read repair is enabled (e.g., not range scans), and
 (3) no node ever happens to be down, flap, or drop a request when you
 touch the data in question.
 
 Basically, unless you are really sure what you're doing - run repair.
 
 -- 
 / Peter Schuller



difference between compaction, repair, clean

2011-03-30 Thread Jonathan Colby
I'm a little unclear on the differences between the nodetool operations:

- compaction
- repair 
- clean

I understand that compaction consolidates the SSTables and physically performs 
deletes by taking into account the Tombstones.  But what does clean and repair 
do then?





Re: Central monitoring of Cassandra cluster

2011-03-25 Thread Jonathan Colby
Cacti and Munin are great for graphing, nagios is good for monitoring.

I wrote a very simple JMX proxy that you can send a request to and it retrieves 
the desired JMX beans.

there are jmx proxys out there if you don't want to write your own, for example 
http://code.google.com/p/polarrose-jmx-rest-bridge/

There is even a JMX proxy that integrates with nagios.  I don't remember the 
name but google will help you.



On Mar 24, 2011, at 7:44 PM, mcasandra wrote:

 Can someone share if they have centralized monitoring for all cassandra
 servers. With many nodes it becomes difficult to monitor them individually
 unless we can look at data in one place. I am looking at solutions where
 this can be done. Looking at Cacti currently but not sure how to integrate
 it with JMX.
 
 --
 View this message in context: 
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Central-monitoring-of-Cassandra-cluster-tp6205275p6205275.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
 Nabble.com.



how does cassandra pick its replicant peers?

2011-03-25 Thread Jonathan Colby

Does anyone know how cassandra chooses the nodes for its other replicant copies?

The first node gets the first copy because its token is assigned for that key.  
 But what about the other copies of the data?

Do the replicant nodes stay the same based on the token range?  Or are the 
other copies send to any random node based on its load and availability?

I think this is important in order to understand because it affects how to plan 
for situations where a significant number of nodes are suddenly unavailable, 
such as the loss of a data center.  

If the replicants are copied just based on random availability, then quorum 
writes could survive on the remaining nodes.  But if the replicant nodes are 
somehow pre-determined, those replicants may node be available and writes will 
fail.





Quorum, Hector, and datacenter preference

2011-03-24 Thread Jonathan Colby
Hi -

Our cluster is spread between 2 datacenters.   We have a straight-forward IP 
assignment so that OldNetworkTopology (rackinferring snitch) works well.We 
have cassandra clients written in Hector in each of those data centers.   The 
Hector clients all have a list of all cassandra nodes across both data centers. 
 RF=3.

Is there an order as to which data center gets the first write?In other 
words, would (or can) the Hector client do its first write to the cassandra 
nodes in its own data center?

It would be ideal it Hector chose the local cassandra nodes.  That way, if 
one data center is unreachable, the Quorum of replicas in cassandra is still 
reached (because it was written to the working data center first).

Otherwise, if the cassandra writes are really random from the Hector client 
point-of-view, a data center outage would result in a read failure for any data 
that has 2 replicas in the lost data center.

Is anyone doing this?  Is there a flaw in my logic?




Re: Quorum, Hector, and datacenter preference

2011-03-24 Thread Jonathan Colby
Indeed I found the big flaw in my own logic.   Even writing to the local 
cassandra nodes does not guarantee where the replicas will end up.   The 
decision where to write the first replica is based on the token ring, which is 
spread out on all nodes regardless of datacenter.   right ?

On Mar 24, 2011, at 2:02 PM, Jonathan Colby wrote:

 Hi -
 
 Our cluster is spread between 2 datacenters.   We have a straight-forward IP 
 assignment so that OldNetworkTopology (rackinferring snitch) works well.
 We have cassandra clients written in Hector in each of those data centers.   
 The Hector clients all have a list of all cassandra nodes across both data 
 centers.  RF=3.
 
 Is there an order as to which data center gets the first write?In other 
 words, would (or can) the Hector client do its first write to the cassandra 
 nodes in its own data center?
 
 It would be ideal it Hector chose the local cassandra nodes.  That way, if 
 one data center is unreachable, the Quorum of replicas in cassandra is still 
 reached (because it was written to the working data center first).
 
 Otherwise, if the cassandra writes are really random from the Hector client 
 point-of-view, a data center outage would result in a read failure for any 
 data that has 2 replicas in the lost data center.
 
 Is anyone doing this?  Is there a flaw in my logic?
 
 



Deleting old SSTables

2011-03-22 Thread Jonathan Colby
According to the Wiki Page on compaction:  once compaction is finished, the old 
SSTable files may be deleted*

* http://wiki.apache.org/cassandra/MemtableSSTable

I thought the old SSTables would be deleted automatically, but this wiki page 
got me thinking otherwise.

Question is,  if it is true that old SSTables must be manually deleted, how can 
one safely identify which SSTables can be deleted??

Jon







Changing memtable_throughput_in_mb on a running system

2011-03-22 Thread Jonathan Colby
It seems some settings like memtable_throughput_in_mb  are Keyspace-specific 
(at least with 0.7.4).

How can these settings best be changed on a running cluster?

PS - preferable by a sysadmin using nodetool or cassandra-cli

Thanks!
Jon

Re: Deleting old SSTables

2011-03-22 Thread Jonathan Colby
doooh.  thanks!
On Mar 22, 2011, at 3:27 PM, Jonathan Ellis wrote:

 From the next paragraph of the same wiki page:
 
 SSTables that are obsoleted by a compaction are deleted asynchronously
 when the JVM performs a GC. You can force a GC from jconsole if
 necessary, but Cassandra will force one itself if it detects that it
 is low on space. A compaction marker is also added to obsolete
 sstables so they can be deleted on startup if the server does not
 perform a GC before being restarted.
 
 On Tue, Mar 22, 2011 at 8:30 AM, Jonathan Colby
 jonathan.co...@gmail.com wrote:
 According to the Wiki Page on compaction:  once compaction is finished, the 
 old SSTable files may be deleted*
 
 * http://wiki.apache.org/cassandra/MemtableSSTable
 
 I thought the old SSTables would be deleted automatically, but this wiki 
 page got me thinking otherwise.
 
 Question is,  if it is true that old SSTables must be manually deleted, how 
 can one safely identify which SSTables can be deleted??
 
 Jon
 
 
 
 
 
 
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com



Meaning of TotalReadLatencyMicros and TotalWriteLatencyMicrosStatistics

2011-03-22 Thread Jonathan Colby
Hi -

On our recently live cassandra cluster of 5 nodes, we've noticed that the 
latency readings, especially Reads have gone up drastically. 

TotalReadLatencyMicros  5413483
TotalWriteLatencyMicros 1811824


I understand these are in microseconds, but what meaning do they have for the 
performance of the cluster?   In other words what do these numbers actually 
measure.

In our case, it looks like we have  a read latency of 5.4 seconds, which is 
very troubling if I interpret this correctly.

Are reads really taking an average of 5 seconds to complete??





cassandra nodes with mixed hard disk sizes

2011-03-21 Thread Jonathan Colby

This is a two part question ...

1. If you have cassandra nodes with different sized hard disks,  how do you 
deal with assigning the token ring such that the nodes with larger disks get 
more data?   In other words, given equally distributed token ranges, when the 
smaller disk nodes run out of space, the larger disk nodes with still have 
unused capacity.Or is installing a mixed hardware cluster a no-no?

2. What happens when a cassandra node runs out of disk space for its data 
files?  Does it continue serving the data while not accepting new data?  Or 
does the node break and require manual intervention?

This info has alluded me elsewhere.
Jon

Re: script to modify cassandra.yaml file

2011-03-21 Thread Jonathan Colby
We use Puppet to manage the cassandra.yaml in a different location from the 
installation.   Ours is in /etc/cassandra/cassandra.yaml 

You can set the environment CASSANDRA_CONF (i believe it is.  check the 
cassandra.in.sh)  and the startup script will pick up this as the configuration 
file to use.

With Puppet you can manage the list of seeds, set the IP addresses, etc 
dynamically.   I even use it to set the initial tokens.  It makes life a lot 
easier.



On Mar 21, 2011, at 9:14 AM, Sasha Dolgy wrote:

 I use grep / awk / sed from within a bash script ... this works quite well.
 -sd
 
 On Mon, Mar 21, 2011 at 12:39 AM, Anurag Gujral anurag.guj...@gmail.com 
 wrote:
 Hi All,
   I want to modify the values in the cassandra.yaml which comes with
 the cassandra-0.7 package based on values of hostnames,
 colo etc.
 Does someone knows of some script which I can use which reads in default
 cassandra.yaml and write outs new cassandra.yaml
 with values based on number of nodes in the cluster ,hostname,colo name etc.



Replacing a dead seed

2011-03-17 Thread Jonathan Colby
Hi - 

If a seed crashes (i.e., suddenly unavailable due to HW problem),   what is the 
best way to replace the seed in the cluster?

I've read that you should not bootstrap a seed.  Therefore I came up with this 
procedure, but it seems pretty complicated.  any better ideas?
 
1. update the seed list on all nodes, taking out the dead node  and restart the 
nodes in the  cluster so the new seed list is updated
2. then bootstrap the new (replacement ) node as a normal node  (not yet as a 
seed)
3. when bootstrapping is done, make the new node a seed.
4. update the seed list again adding back the replacement seed (and rolling 
restart the cluster as in step 1)


That seems to me like a whole lot of work.  Surely there is a better way?

Jon

OldNetworkTopologyStrategy with one data center

2011-03-15 Thread Jonathan Colby
Hi -

I have a question. Obviously there is no purpose in running
OldNetworkTopologyStrategy in one data center.  However,  we want to
share the same configuration in our production (multiple data centers)
and pre-production (one data center) environments.

My question is will
org.apache.cassandra.locator.OldNetworkTopologyStrategy function with
one data center and RackInferringSnitch?

Jon


where to find the stress testing programs?

2011-03-15 Thread Jonathan Colby
According to the Cassandra Wiki and OReilly book supposedly there is a
contrib directory within the cassandra download containing the
Python Stress Test script stress.py.  It's not in the binary tarball
of 0.7.3.

Anyone know where to find it?

Anyone know of other, maybe better stress testing scripts?

Jon


Re: Virtual IP / hardware load balancing for cassandra nodes

2010-12-20 Thread Jonathan Colby
Thanks guys.   
On Dec 20, 2010, at 5:44 PM, Dave Viner wrote:

 You can put a Cassandra cluster behind a load balancer.  One thing to be 
 cautious of is the health check.  Just because the node is listening on port 
 9160 doesn't mean that it's healthy to serve requests.  It is required, but 
 not sufficient.
 
 The real test is the JMX values.  
 
 Dave Viner
 
 
 On Mon, Dec 20, 2010 at 6:25 AM, Jonathan Colby jonathan.co...@gmail.com 
 wrote:
 I was unable to find example or documentation on my question.  I'd like to 
 know what the best way to group a cluster of cassandra nodes behind a virtual 
 ip.
 
 For example, can cassandra nodes be placed behind a Citrix Netscaler hardware 
 load balancer?
 
 I can't imagine it being a problem, but in doing so would you break any 
 cassandra functionality?
 
 The goal is to have the application talk to a single virtual ip  and be 
 directed to a random node in the cluster.
 
 I heard a little about adding the node addresses to Hector's load-balancing 
 mechanism, but this doesn't seem too robust or easy to maintain.
 
 Thanks in advance.
 



Quorum and Datacenter loss

2010-12-12 Thread Jonathan Colby
Hi cassandra experts -

We're planning a cassandra cluster across 2 datacenters
(datacenter-aware, random partitioning) with QUORUM consistency.

It seems to me that with 2 datacenters, if one datacenter is lost,
the  reads/writes to cassandra  will fail in the surviving datacenter
because of the N/2 + 1 distribution of replicas.  In other words, you
need more than half of the replicas to respond but in the case of a
datacenter loss you would only ever get 1/2 to respond at best.

Is my logic wrong here?  Is there a way to ensure the nodes in the
alive datacenter respond successfully if the second datacenter is
lost?  Anyone have experience with this kind of problem?

Thanks.


Re: Quorum and Datacenter loss

2010-12-12 Thread Jonathan Colby
Thanks a lot Peter.   So basically we would need to choose a
consistency other than QUORUM.I think in our case consistency is
not necessarily an issue since our data is write-once, read-many
(immutable data).   I suppose having a replication factor of 4 would
result in two nodes in each datacenter having a copy of the data.   If
there's a flaw in my logic, please let me know : ]

On Sun, Dec 12, 2010 at 2:04 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
 Is my logic wrong here?  Is there a way to ensure the nodes in the
 alive datacenter respond successfully if the second datacenter is
 lost?  Anyone have experience with this kind of problem?

 It's impossible to achieve the consistency and availability at the
 same time. See:

 (Assuming partition tolerance)

 Anyways, to expand a bit: The final consequence is that if you have a
 cluster that really does need QUORUM consistency, you won't be able to
 survive (in terms of availability, i.e., the cluster serving your
 traffic) data centers going down. If you want to continue operating in
 the case of a partition, you (1) cannot use QUORUM and (2) your
 application must be designed to work with and survive seeing
 inconsistent data.

 --
 / Peter Schuller



understanding the cassandra storage scaling

2010-12-09 Thread Jonathan Colby
I have a very basic question which I have been unable to find in
online documentation on cassandra.

It seems like every node in a cassandra cluster contains all the data
ever stored in the cluster (i.e., all nodes are identical).  I don't
understand how you can scale this on commodity servers with merely
internal hard disks.   In other words, if I want to store 5 TB of
data, does that each node need a hard disk capacity of 5 TB??

With HBase, memcached and other nosql solutions it is more clear how
data is spilt up in the cluster and replicated for fault tolerance.
Again, please excuse the rather basic question.


Re: understanding the cassandra storage scaling

2010-12-09 Thread Jonathan Colby
Thanks Ran.  This helps a little but unfortunately I'm still a bit
fuzzy for me.  So is it not true that each node contains all the data
in the cluster? I haven't come across any information on how clustered
data is coordinated in cassandra.  how does my query get directed to
the right node?

On Thu, Dec 9, 2010 at 11:35 AM, Ran Tavory ran...@gmail.com wrote:
 there are two numbers to look at, N the numbers of hosts in the ring
 (cluster) and R the number of replicas for each data item. R is configurable
 per column family.
 Typically for large clusters N  R. For very small clusters if makes sense
 for R to be close to N in which case cassandra is useful so the database
 doesn't have a single a single point of failure but not so much b/c of the
 size of the data. But for large clusters it rarely makes sense to have N=R,
 usually N  R.

 On Thu, Dec 9, 2010 at 12:28 PM, Jonathan Colby jonathan.co...@gmail.com
 wrote:

 I have a very basic question which I have been unable to find in
 online documentation on cassandra.

 It seems like every node in a cassandra cluster contains all the data
 ever stored in the cluster (i.e., all nodes are identical).  I don't
 understand how you can scale this on commodity servers with merely
 internal hard disks.   In other words, if I want to store 5 TB of
 data, does that each node need a hard disk capacity of 5 TB??

 With HBase, memcached and other nosql solutions it is more clear how
 data is spilt up in the cluster and replicated for fault tolerance.
 Again, please excuse the rather basic question.



 --
 /Ran



Re: understanding the cassandra storage scaling

2010-12-09 Thread Jonathan Colby
awesome!  Thank you guys for the really quick answers and the links to
the presentations.

On Thu, Dec 9, 2010 at 12:06 PM, Sylvain Lebresne sylv...@yakaz.com wrote:
 This helps a little but unfortunately I'm still a bit fuzzy for me.  So is it
 not true that each node contains all the data in the cluster?

 Not at all. Basically each node is responsible of only a part of the data (a
 range really). But for each data you can choose on how many nodes it is; this
 is the Replication Factor.

 For instance, if you choose to have RF=1, then each piece of data will be on
 exactly one node (this is usually a bad idea since it offers very weak
 durability guarantees but nevertheless, it can be done).

 If you choose RF=3, each piece of data is on 3 nodes (independently of the
 number of nodes your cluster have). You can have all data on all node, but for
 that you'll have to choose RF=#{nodes in the cluster}. But this is a very
 degenerate case.

 how does my query get directed to the right node?

 Each node in the cluster knows the ranges of data each other nodes hold. I
 suggest you watch the first video linked in this page
  http://wiki.apache.org/cassandra/ArticlesAndPresentations
 It explains this and more.

 --
 Sylvain