Re: nodetool repair caused high disk space usage

2011-08-23 Thread Héctor Izquierdo Seliva
El sáb, 20-08-2011 a las 01:22 +0200, Peter Schuller escribió:
  Is there any chance that the entire file from source node got streamed to
  destination node even though only small amount of data in hte file from
  source node is supposed to be streamed destination node?
 
 Yes, but the thing that's annoying me is that even if so - you should
 not be seeing a 40 gb - hundreds of gig increase even if all
 neighbors sent all their data.
 

I'm having the very same issue. Trying to repair a node with 90GB of
data fills up the 1.5TB drive, and it's still trying to send more data.
This is on 0.8.1.




nodetool repair mykeyspace mycolumnfamily repairs all the keyspace

2011-07-19 Thread Héctor Izquierdo Seliva
Hi all,

Maybe I'm doing something wrong, but calling ./nodetool -h host repair
mykeyspace mycolumnfamily should only repair mycolumnfamily right?
Everytime I try a repair it repairs the whole key space instead of just
one column family. I'm on cassandra 0.8.1



Re: nodetool repair mykeyspace mycolumnfamily repairs all the keyspace

2011-07-19 Thread Héctor Izquierdo Seliva
Are there any plans to backport this to 0.8?

El mar, 19-07-2011 a las 11:43 -0500, Jonathan Ellis escribió:
 https://issues.apache.org/jira/browse/CASSANDRA-2280
 
 2011/7/19 Héctor Izquierdo Seliva izquie...@strands.com:
  Hi all,
 
  Maybe I'm doing something wrong, but calling ./nodetool -h host repair
  mykeyspace mycolumnfamily should only repair mycolumnfamily right?
  Everytime I try a repair it repairs the whole key space instead of just
  one column family. I'm on cassandra 0.8.1
 
 
 
 
 




Re: Anyone using Facebook's flashcache?

2011-07-18 Thread Héctor Izquierdo Seliva

 
 Hector, some before/after numbers would be great if you can find them.  
 Thanks!
 

I'll try and get some for you :)

 What happens when your cache gets trashed?  Do compactions and flushes 
 go slower?
 

If you use flashcache-wt flushed and compacted sstables will go to the
cache. 

All reads are cached, so if you compact three sstables into one, you are
stuffing your cache with a lot of useless crap and evicting valid blocks
(flashcache won't honor any of the hints set with fadvise, as it's a
block cache layer and doesn't know of them anyway). If your write rate
is low it might work for you. 



 aj
 
 
 




Re: Anyone using Facebook's flashcache?

2011-07-18 Thread Héctor Izquierdo Seliva

 Interesting.  So, there is no segregation between read and write cache 
 space?  A compaction or flush can evict blocks in the read cache if it 
 needs the space for write buffering?

There are two versions, the -wt (write through) that will cache also
what is written, and the normal version that will only cache reads.
Either way you will pollute your cache with compactions.



Re: Anyone using Facebook's flashcache?

2011-07-18 Thread Héctor Izquierdo Seliva

 
 If using the version that has both rt and wt caches, is it just the wt 
 cache that's polluted for compactions/flushes?  If not, why does the rt 
 cache also get polluted?
 

As I said, all reads go through flashcache, so if you read three 10 GB
sstables for a compaction you will get those 30 GB into the cache. 



Re: Anyone using Facebook's flashcache?

2011-07-18 Thread Héctor Izquierdo Seliva

 Of course.  I wasn't thinking clearly.
 
 So, back to a previous point you brought up, I will have heavy reads and 
 even heavier writes.  How would you rate the benefits of flashcache in 
 such a scenario?  Is it still an overall performance boost worth the 
 expense?

We have also heavy reads and even heavier writes. The max hit ratio I
saw was of ~60%, altought it was enough to halve the latency, which is a
fairly good result. I guess you will have to try it for yourself. I'm
sorry I can not give you anything more concrete.

We have moved since to host all the data on ssds, as it's not that
expensive anymore and the results are much much better. You can get
320GB drives from OCZ for ~500$, or do a stripe with three 120GB ones
for ~700$.



Re: Anyone using Facebook's flashcache?

2011-07-17 Thread Héctor Izquierdo Seliva
I've been using flashcache for a while in production. It improves read
performance and latency was halved by a good chunk, though I don't
remember the exact numbers. 

Problems: compactions will trash your cache, and so will memtable
flushes. Right now there's no way to avoid that.

If you want, I could dig the numbers for a before/after comparison. 




Re: Corrupted data

2011-07-10 Thread Héctor Izquierdo Seliva
All the important stuff is using QUORUM. Normal operation uses around
3-4 GB of heap out of 6. I've also tried running repair on a per CF
basis, and still no luck. I've found it's faster to bootstrap a node
again than repairing it.

Once I have the cluster in a sane state I'll try running a repair as
part of normal operation and see if manages to finish.

Btw, we are not using super columns.

Thanks for the tips

El sáb, 09-07-2011 a las 17:57 -0700, aaron morton escribió:
  Nop, only when something breaks
 Unless you've been working at QUORUM life is about to get trickier.  Repair 
 is an essential part of running a cassandra cluster, without it you risk data 
 loss and dead data coming back to life. 
 
 If you have been writing at QUORUM, so have a reasonable expectation of data 
 replication, the normal approach is to happily let scrub skip the rows, after 
 scrub has completed a repair will see the data repaired using one of the 
 other replicas. That's probably already happened as the scrub process skipped 
 the rows when writing them out to the new files. 
 
 Try to run repair. Try running it on a single CF to start with.
 
 
 Good luck
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 9 Jul 2011, at 16:45, Héctor Izquierdo Seliva wrote:
 
  Hi Peter.
  
  I have a problem with repair, and it's that it always brings the node
  doing the repairs down. I've tried setting index_interval to 5000, and
  it still dies with OutOfMemory errors, or even worse, it generates
  thousands of tiny sstables before dying.
  
  I've tried like 20 repairs during this week. None of them finished. This
  is on a 16GB machine using 12GB heap so it doesn't crash (too early).
  
  
  El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió:
  - Have you been running repair consistently ?
  
  Nop, only when something breaks
  
  This is unrelated to the problem you were asking about, but if you
  never run delete, make sure you are aware of:
  
  http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
  http://wiki.apache.org/cassandra/DistributedDeletes
  
  
  
  
 




Re: node stuck leaving

2011-07-10 Thread Héctor Izquierdo Seliva
I'm also having problems with removetoken. Maybe I'm doing it wrong, but
I was under the impression that I just had to call once removetoken.
When I take a look at the nodes ring, the dead node keeps popping up.
What's even more incredible is that in some of them it says UP




Re: node stuck leaving

2011-07-10 Thread Héctor Izquierdo Seliva
At the end I had to restart the whole cluster. This is the second time
I've had to do this. Would it be possible to add a command that forces
all nodes to remove all the ring data and start it fresh? I'd rather
have a few seconds of errors in the clients that the two to five minutes
that takes a full cluster restart.



Re: Corrupted data

2011-07-09 Thread Héctor Izquierdo Seliva
Hi Peter.

 I have a problem with repair, and it's that it always brings the node
doing the repairs down. I've tried setting index_interval to 5000, and
it still dies with OutOfMemory errors, or even worse, it generates
thousands of tiny sstables before dying.

I've tried like 20 repairs during this week. None of them finished. This
is on a 16GB machine using 12GB heap so it doesn't crash (too early).


El sáb, 09-07-2011 a las 16:16 +0200, Peter Schuller escribió:
  - Have you been running repair consistently ?
 
  Nop, only when something breaks
 
 This is unrelated to the problem you were asking about, but if you
 never run delete, make sure you are aware of:
 
 http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
 http://wiki.apache.org/cassandra/DistributedDeletes
 
 




Corrupted data

2011-07-08 Thread Héctor Izquierdo Seliva
Hi everyone,

I'm having thousands of these errors:

 WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705
CompactionManager.java (line 737) Non-fatal error reading row
(stacktrace follows)
java.io.IOError: java.io.IOException: Impossible row size
6292724931198053
at
org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:719)
at
org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:633)
at org.apache.cassandra.db.compaction.CompactionManager.access
$600(CompactionManager.java:65)
at org.apache.cassandra.db.compaction.CompactionManager
$3.call(CompactionManager.java:250)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.IOException: Impossible row size 6292724931198053
... 9 more
 INFO [CompactionExecutor:1] 2011-07-08 16:36:45,705
CompactionManager.java (line 743) Retrying from row index; data is -8
bytes starting at 4735525245
 WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705
CompactionManager.java (line 767) Retry failed too.  Skipping to next
row (retry's stacktrace follows)
java.io.IOError: java.io.EOFException: bloom filter claims to be
863794556 bytes, longer than entire row size -8


THis is during scrub, as I saw similar errors while in normal operation.
Is there anything I can do? It looks like I'm going to lose a ton of
data



Re: Corrupted data

2011-07-08 Thread Héctor Izquierdo Seliva
Hi Aaron,

El vie, 08-07-2011 a las 14:47 -0700, aaron morton escribió:
 You may not lose data. 
 
 - What version and whats the upgrade history?

all versions from 0.7.1 to 0.8.1. All cfs were in 0.8.1 format though

 - What RF / node count / CL  ?

RF=3, node count = 6
 - Have you been running repair consistently ?

Nop, only when something breaks

 - Is this on a single node or all nodes ?

A couple of nodes. Scrub told there were a few thousand of columns it
could not restore.
 
 Cheers
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 8 Jul 2011, at 09:38, Héctor Izquierdo Seliva wrote:
 
  Hi everyone,
  
  I'm having thousands of these errors:
  
  WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705
  CompactionManager.java (line 737) Non-fatal error reading row
  (stacktrace follows)
  java.io.IOError: java.io.IOException: Impossible row size
  6292724931198053
  at
  org.apache.cassandra.db.compaction.CompactionManager.scrubOne(CompactionManager.java:719)
  at
  org.apache.cassandra.db.compaction.CompactionManager.doScrub(CompactionManager.java:633)
  at org.apache.cassandra.db.compaction.CompactionManager.access
  $600(CompactionManager.java:65)
  at org.apache.cassandra.db.compaction.CompactionManager
  $3.call(CompactionManager.java:250)
  at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
  at java.util.concurrent.ThreadPoolExecutor
  $Worker.runTask(ThreadPoolExecutor.java:886)
  at java.util.concurrent.ThreadPoolExecutor
  $Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:662)
  Caused by: java.io.IOException: Impossible row size 6292724931198053
  ... 9 more
  INFO [CompactionExecutor:1] 2011-07-08 16:36:45,705
  CompactionManager.java (line 743) Retrying from row index; data is -8
  bytes starting at 4735525245
  WARN [CompactionExecutor:1] 2011-07-08 16:36:45,705
  CompactionManager.java (line 767) Retry failed too.  Skipping to next
  row (retry's stacktrace follows)
  java.io.IOError: java.io.EOFException: bloom filter claims to be
  863794556 bytes, longer than entire row size -8
  
  
  THis is during scrub, as I saw similar errors while in normal operation.
  Is there anything I can do? It looks like I'm going to lose a ton of
  data
  
 




Re: Cannot recover SSTable with version f (current version g)

2011-07-06 Thread Héctor Izquierdo Seliva
Thanks Sylvain. I'll try option 3 when the current repair ends so I can
fix the remaining CFs.

I'm also finding a few of this while opening sstables that have been
build with repair (SSTable build compactions)

ERROR [CompactionExecutor:2] 2011-07-06 10:09:16,054
AbstractCassandraDaemon.java (line 113) Fatal exception in thread
Thread[CompactionExecutor:2,1,main]
java.lang.NullPointerException
at org.apache.cassandra.io.sstable.SSTableWriter
$RowIndexer.close(SSTableWriter.java:382)
at org.apache.cassandra.io.sstable.SSTableWriter
$RowIndexer.index(SSTableWriter.java:370)
at org.apache.cassandra.io.sstable.SSTableWriter
$Builder.build(SSTableWriter.java:315)
at org.apache.cassandra.db.compaction.CompactionManager
$9.call(CompactionManager.java:1103)
at org.apache.cassandra.db.compaction.CompactionManager
$9.call(CompactionManager.java:1094)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor
$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



El mié, 06-07-2011 a las 11:22 +0200, Sylvain Lebresne escribió:
 2011/7/6 Héctor Izquierdo Seliva izquie...@strands.com:
  Hi, i've been struggling to repair my failed node for the past few days,
  and I've seen this erros a few times.
 
  java.lang.RuntimeException: Cannot recover SSTable with version f
  (current version g).
 
  If it can read the sstables, why can't they be used to repair?
 
 After a sstable is streamed (by repair in particular), we first have to 
 recreate
 a number of data structures. To do that, a specific part of the code is used
 that do not know how to cope with older sstable version (hence the message).
 This does not mean that this won't even get fixed, it will, but it is a 
 current
 limitation.
 
  Is there anything I can do besides running scrub or compact on all the 
  cluster?
 
 Not really no. You can wait enough time so that all old sstables have been
 compacted to newer version before running a repair, but since that could take
 a long time if you have lots of data, that may not always be an option.
 Now probably the most efficient way (the quote are because it's not the most
 efficient in time and investment it will require from you) is probably 
 something
 like that: For each of the nodes of the cluster:
   1) look the sstables on the node (do a 'ls' on the data repository).
 The sstable
 will look something like Standard1-g-105-Data.db where the 'g' may be a 'f' 
 for
 some (all?) of the sstable. The 'f' means old sstable, the 'g' means new ones.
   2) if all or almost all the big sstables are 'f', then run scrub on
 that node, that'll
 be simpler.
   3) if only a handfull of sstable are on 'f' (typically, if the node
 has not just been updated),
 then what you can do is to force a compaction on those only by using
 JMX-CompactionManager-forceUserDefinedCompaction,
 giving it the keyspace name as first argument and the full path to the
 sstable as second
 argument.
 
 
 
 
  Regards
 
  Hector Izquierdo
 
 
 
 




OutOfMemory during repair on 0.8.1

2011-07-06 Thread Héctor Izquierdo Seliva
Hi all,

I don't seem to be able to complete a full repair on one of the nodes.
Memory consuptiom keeps growing till it starts complaining about not
having enough heap. I had to disable the automatic memtable flush, as it
was generating thousands of almost empty memtables.

My guess is that the key indexes that are kept in memory grow with each
new sstable that the repair generates. On my third attempt, I have 1102
pending tasks (which seems to be almost all SSTable build operations
mixed with a few compactions), and heap is like 80%-85% full

Is there a setting or tweak I can make? I had to up the heap from 6GB to
10GB in a 12 GB machine. I can't give it any more heap.





Re: OutOfMemory during repair on 0.8.1

2011-07-06 Thread Héctor Izquierdo Seliva
Forcing a full gc doesn't help either. Now the node is stuck in an
endless loop of full gcs that don't free any memory.



Re: Repair doesn't work after upgrading to 0.8.1

2011-07-05 Thread Héctor Izquierdo Seliva
Hi All, sorry for taking so long to answer. I was away from the
internet.

 Héctor, when you say I have upgraded all my cluster to 0.8.1, from
  which version was
  that: 0.7.something or 0.8.0 ?

0.7.6-2 to 0.8.1

 This is the same behavior I reported in 2768 as Aaron referenced ...
  What was suggested for us was to do the following:
 
  - Shut down the entire ring
  - When you bring up each node, do a nodetool repair
 

That's exactly what I ended up doing. Repair now works. I tried to do a
rolling restart with 2818 applied, but it did not work.

 However, in the issue reported, it was unable to be reproduced ... I'd
  be curious to know how Hector's keyspace is defined.  Ours at the
time
  was RF=3 and using Ec2 snitch...

Nothing special, Default snithch, RF=3.

I think this should be prioritized, as having to restart the whole
cluster is a bit extreme. We don't have separate DCs, so I had to
incurre on downtime, which costs money, and a little bit of grief.


El vie, 01-07-2011 a las 10:16 +0200, Sylvain Lebresne escribió:
 To make it clear what the problem is, this is not a repair problem. This is
 a gossip problem. Gossip is reporting that the remote node is a 0.7 node
 and repair is just saying I cannot use that node because repair has changed
 and the 0.7 node will not know how to answer me correctly, which is the
 correct behavior if the node happens to be a 0.7 node.
 
 Hence, I'm kind of baffled that dropping a keyspace and recreating it fixed
 anything. Unless as part of removed the keyspace, you've deleted the
 system tables, in which case that could have triggered something.
 
 --
 Sylvain
 
 On Fri, Jul 1, 2011 at 9:33 AM, Sasha Dolgy sdo...@gmail.com wrote:
  This is the same behavior I reported in 2768 as Aaron referenced ...
  What was suggested for us was to do the following:
 
  - Shut down the entire ring
  - When you bring up each node, do a nodetool repair
 
  That didn't immediately resolve the problems.  In the end, I backed up
  all the data, removed the keyspace and created a new one.  That seemed
  to have solved our problems.  That was from 0.7.6-2 to 0.8.0
 
  However, in the issue reported, it was unable to be reproduced ... I'd
  be curious to know how Hector's keyspace is defined.  Ours at the time
  was RF=3 and using Ec2 snitch...
 
  -sd
 
  On Fri, Jul 1, 2011 at 9:22 AM, Sylvain Lebresne sylv...@datastax.com 
  wrote:
  Héctor, when you say I have upgraded all my cluster to 0.8.1, from
  which version was
  that: 0.7.something or 0.8.0 ?
 
  If this was 0.8.0, did you run successful repair on 0.8.0 previous to
  the upgrade ?
 




Repair doesn't work after upgrading to 0.8.1

2011-06-30 Thread Héctor Izquierdo Seliva
Hi all,

I have upgraded all my cluster to 0.8.1. Today one of the disks in one
of the nodes died. After replacing the disk I tried running repair, but
this message appears:

 INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30
20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.80
from repair because it is on version 0.7 or sooner. You should consider
updating this node before running repair again.
 INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30
20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.76
from repair because it is on version 0.7 or sooner. You should consider
updating this node before running repair again.
 INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30
20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.80
from repair because it is on version 0.7 or sooner. You should consider
updating this node before running repair again.
 INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30
20:36:25,086 AntiEntropyService.java (line 179) Excluding /10.20.13.77
from repair because it is on version 0.7 or sooner. You should consider
updating this node before running repair again.
 INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30
20:36:25,085 AntiEntropyService.java (line 179) Excluding /10.20.13.76
from repair because it is on version 0.7 or sooner. You should consider
updating this node before running repair again.
 INFO [manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098] 2011-06-30
20:36:25,086 AntiEntropyService.java (line 782) No neighbors to repair
with for sbs on
(170141183460469231731687303715884105727,28356863910078205288614550619314017621]:
 manual-repair-26f5a7dd-cf12-44de-9f8f-6b6335bdd098 completed.
 INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30
20:36:25,086 AntiEntropyService.java (line 179) Excluding /10.20.13.79
from repair because it is on version 0.7 or sooner. You should consider
updating this node before running repair again.
 INFO [manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf] 2011-06-30
20:36:25,086 AntiEntropyService.java (line 782) No neighbors to repair
with for sbs on
(141784319550391026443072753096570088105,170141183460469231731687303715884105727]:
 manual-repair-bdb4055a-d370-4d2a-a1dd-70a7e4fa60cf completed.
 INFO [manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a] 2011-06-30
20:36:25,087 AntiEntropyService.java (line 782) No neighbors to repair
with for sbs on
(113427455640312821154458202477256070484,141784319550391026443072753096570088105]:
 manual-repair-2a11d01c-e1e4-4f1e-b8cd-00a9a3fd2f4a completed.

What can I do?



Re: insufficient space to compact even the two smallest files, aborting

2011-06-23 Thread Héctor Izquierdo Seliva
Hi Aaron. Reverted back to 4-32. Did the flush but it did not trigger
any minor compaction. Ran compact by hand, and it picked only two
sstables.

Here's the ls before:

http://pastebin.com/xDtvVZvA

And this is the ls after:

http://pastebin.com/DcpbGvK6

Any suggestions?



El jue, 23-06-2011 a las 10:55 +1200, aaron morton escribió:
 Setting them to 2 and 2 means compaction can only ever compact 2 files at 
 time, so it will be worse off.
 
 Lets the try following:
 
 - restore the compactions settings to the default 4 and 32
 - run `ls -lah` in the data dir and grab the output
 - run `nodetool flush` this will trigger minor compaction once the memtables 
 have been flushed
 - check the logs for messages from 'CompactionManager'
 - when done grab the output from  `ls -lah` again. 
 
 Hope that helps. 
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 23 Jun 2011, at 02:04, Héctor Izquierdo Seliva wrote:
 
  Hi All. I set the compaction threshold at minimum 2, maximum 2 and try
  to run compact, but it's not doing anything. There are over 69 sstables
  now, read performance is horrible, and it's taking an insane amount of
  space. Maybe I don't quite get how the new per bucket stuff works, but I
  think this is not normal behaviour.
  
  El lun, 13-06-2011 a las 10:32 -0500, Jonathan Ellis escribió:
  As Terje already said in this thread, the threshold is per bucket
  (group of similarly sized sstables) not per CF.
  
  2011/6/13 Héctor Izquierdo Seliva izquie...@strands.com:
  I was already way over the minimum. There were 12 sstables. Also, is
  there any reason why scrub got stuck? I did not see anything in the
  logs. Via jmx I saw that the scrubbed bytes were equal to one of the
  sstables size, and it stuck there for a couple hours .
  
  El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió:
  That most likely happened just because after scrub you had new files
  and got over the 4 file minimum limit.
  
  https://issues.apache.org/jira/browse/CASSANDRA-2697
  
  Is the bug report.
  
  
  
  
  
  
  
  
  
  
 




Re: insufficient space to compact even the two smallest files, aborting

2011-06-22 Thread Héctor Izquierdo Seliva
Hi All. I set the compaction threshold at minimum 2, maximum 2 and try
to run compact, but it's not doing anything. There are over 69 sstables
now, read performance is horrible, and it's taking an insane amount of
space. Maybe I don't quite get how the new per bucket stuff works, but I
think this is not normal behaviour.

El lun, 13-06-2011 a las 10:32 -0500, Jonathan Ellis escribió:
 As Terje already said in this thread, the threshold is per bucket
 (group of similarly sized sstables) not per CF.
 
 2011/6/13 Héctor Izquierdo Seliva izquie...@strands.com:
  I was already way over the minimum. There were 12 sstables. Also, is
  there any reason why scrub got stuck? I did not see anything in the
  logs. Via jmx I saw that the scrubbed bytes were equal to one of the
  sstables size, and it stuck there for a couple hours .
 
  El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió:
  That most likely happened just because after scrub you had new files
  and got over the 4 file minimum limit.
 
  https://issues.apache.org/jira/browse/CASSANDRA-2697
 
  Is the bug report.
 
 
 
 
 
 
 
 




Re: Cassandra Statistics and Metrics

2011-06-16 Thread Héctor Izquierdo Seliva
This is what I use:

http://code.google.com/p/simple-cassandra-monitoring/

Disclaimer: I did it myself, don't expect too much :P

El jue, 16-06-2011 a las 19:35 +0300, Viktor Jevdokimov escribió:
 There's possibility to use command line JMX client with standard
 Zabbix agent to request JMX counters without incorporating zapcat into
 Cassandra or another Java app.
 I'm investigating this feature right now, will post results when
 finish.
 
 2011/6/15 Viktor Jevdokimov vjevdoki...@gmail.com
 http://www.kjkoster.org/zapcat/Zapcat_JMX_Zabbix_Bridge.html
 
 
 
 2011/6/14 Marcos Ortiz mlor...@uci.cu
 Where I can find the source code?
 
 El 6/14/2011 10:13 AM, Viktor Jevdokimov escribió: 
 
  We're using open source monitoring solution Zabbix
  from http://www.zabbix.com/ using zapcat - not only
  for Cassandra but for the whole system. 
  
  
  As MX4J tools plugin is supported by Cassandra,
  support of zapcat in Cassandra by default is welcome
  - we have to use a wrapper to start zapcat agent.
  
  2011/6/14 Marcos Ortiz mlor...@uci.cu
  Regards to all.
  My team and me here on the University are
  working on a generic solution for Monitoring
  and Capacity Planning for Open Sources
  Databases, and one of the NoSQL db that we
  choosed to give it support is Cassandra.
  Where I can find all the metrics and
  statistics of Cassandra? I'm thinking for
  example:
  - Available space
  - Number of CF
  and all kind of metrics
  
  We are using for this development: Python +
  Django + Twisted + Orbited + jQuery. The
  idea behind is to build a Comet-based web
  application on top of these technologies.
  Any advice is welcome
  
  -- 
  Marcos Luís Ortíz Valmaseda
   Software Engineer (UCI)
   http://marcosluis2186.posterous.com
   http://twitter.com/marcosluis2186
   
  
  
 
 -- 
 Marcos Luís Ortíz Valmaseda
  Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://twitter.com/marcosluis2186
   
 
 
 
 




Re: insufficient space to compact even the two smallest files, aborting

2011-06-13 Thread Héctor Izquierdo Seliva
Hi All.  I found a way to be able to compact. I have to call scrub on
the column family. Then scrub gets stuck forever. I restart the node,
and voila! I can compact again without any message about not having
enough space. This looks like a bug to me. What info would be needed to
fill a report? This is on 0.8 updating from 0.7.5




Re: insufficient space to compact even the two smallest files, aborting

2011-06-13 Thread Héctor Izquierdo Seliva
I was already way over the minimum. There were 12 sstables. Also, is
there any reason why scrub got stuck? I did not see anything in the
logs. Via jmx I saw that the scrubbed bytes were equal to one of the
sstables size, and it stuck there for a couple hours .

El lun, 13-06-2011 a las 22:55 +0900, Terje Marthinussen escribió:
 That most likely happened just because after scrub you had new files
 and got over the 4 file minimum limit.
 
 https://issues.apache.org/jira/browse/CASSANDRA-2697
 
 Is the bug report.
 





Re: Retrieving a column from a fat row vs retrieving a single row

2011-06-10 Thread Héctor Izquierdo Seliva
I think I will follow the advice of better balancing and I will split
the index into several pieces. Thanks everybody for your input!




Re: insufficient space to compact even the two smallest files, aborting

2011-06-10 Thread Héctor Izquierdo Seliva
Hi Terje,

There are 12 SSTables, so I don't think that's the problem. I will try
anyway and see what happens.

El vie, 10-06-2011 a las 20:21 +0900, Terje Marthinussen escribió:

 bug in the 0.8.0 release version.
 
 
 
 Cassandra splits the sstables depending on size and tries to find (by
 default) at least 4 files of similar size.
 
 
 If it cannot find 4 files of similar size, it logs that message in
 0.8.0.
 
 
 You can try to reduce the minimum required  files for compaction and
 it will work.
 





Re: insufficient space to compact even the two smallest files, aborting

2011-06-10 Thread Héctor Izquierdo Seliva


El vie, 10-06-2011 a las 20:21 +0900, Terje Marthinussen escribió:
 bug in the 0.8.0 release version.
 
 
 Cassandra splits the sstables depending on size and tries to find (by
 default) at least 4 files of similar size.
 
 
 If it cannot find 4 files of similar size, it logs that message in
 0.8.0.
 
 
 You can try to reduce the minimum required  files for compaction and
 it will work.
 
 
 Terje


Hi Terje,

There are 12 SSTables, so I don't think that's the problem. I will try
anyway and see what happens.




Re: insufficient space to compact even the two smallest files, aborting

2011-06-10 Thread Héctor Izquierdo Seliva
El vie, 10-06-2011 a las 23:40 +0900, Terje Marthinussen escribió:
 Yes, which is perfectly fine for a short time if all you want is to
 compact to one file for some reason.
 
 
 I run min_compaction_threshold = 2 on one system here with SSD. No
 problems with the more aggressive disk utilization on the SSDs from
 the extra compactions, reducing disk space is much more important.
 
 
 Note that this is a treshold per bucket of similar sized sstables. Not
 the total number of sstables, so a treshold of 2 will not give you one
 big file.
 
 
 Terje

Cassandra refuses to do a major compaction no matter what I do. There
are 110GB free, and all the sstables I want to compact amount to 15GB,
and the same message keeps popping up.




Re: Retrieving a column from a fat row vs retrieving a single row

2011-06-09 Thread Héctor Izquierdo Seliva
El jue, 09-06-2011 a las 13:28 +0200, Richard Low escribió:
 Remember also that partitioning is done by rows, not columns.  So
 large rows are stored on a single host.  This means they can't be load
 balanced and also all requests to that row will hit one host.  Having
 separate rows will allow load balancing of I/Os.
 

Yeah, but if I have RF=3 then there are three nodes that can answer the
request right?



Re: Data directories

2011-06-09 Thread Héctor Izquierdo Seliva
I'm actually using it in a couple of nodes, but is slower than directly
accesing the data in a ssd.

El jue, 09-06-2011 a las 11:10 -0400, Chris Burroughs escribió:
 On 06/08/2011 05:54 AM, Héctor Izquierdo Seliva wrote:
  Is there a way to control what sstables go to what data directory? I
  have a fast but space limited ssd, and a way slower raid, and i'd like
  to put latency sensitive data into the ssd and leave the other data in
  the raid. Is this possible? If not, how well does cassandra play with
  symlinks?
  
 
 Another option would be to use the ssd as a block level cache with
 something like flashcache https://github.com/facebook/flashcache/.




Data directories

2011-06-08 Thread Héctor Izquierdo Seliva

Hi,

Is there a way to control what sstables go to what data directory? I
have a fast but space limited ssd, and a way slower raid, and i'd like
to put latency sensitive data into the ssd and leave the other data in
the raid. Is this possible? If not, how well does cassandra play with
symlinks?



Re: Data directories

2011-06-08 Thread Héctor Izquierdo Seliva
El mié, 08-06-2011 a las 08:42 -0500, Jonathan Ellis escribió:
 No. https://issues.apache.org/jira/browse/CASSANDRA-2749 is open to
 track this but nobody is working on it to my knowledge.
 
 Cassandra is fine with symlinks at the data directory level but I
 don't think that helps you, since you really want to move the sstables
 themselves. (Cassandra is NOT fine with symlinked sstable files, or
 with any moving around of sstable files while it is running.)

I was planing on creating another keyspace and moving the slow sstables
there. Of course everything done while the node is stopped.

Thanks for your help



Retrieving a column from a fat row vs retrieving a single row

2011-06-08 Thread Héctor Izquierdo Seliva
Hi,

I have an index I use to translate ids. I usually only read a column at
a time, and it's becoming a bottleneck. I could rewrite the application
to read a bunch at a time but it would make the application logic much
harder, as it would involve buffering incoming data.

As far as I know, to read a single column cassandra will deserialize a
bunch of them and then pick the correct one (64KB of data right?)

Would it be faster to have a row for each id I want to translate? This
would make keycache less effective, but the amount of data read should
be smaller.

Thanks!





Concurrent Mark Sweep taking 12 seconds

2011-05-16 Thread Héctor Izquierdo Seliva
Hi everyone. I see in the logs that Concurrent Mark Sweep is taking 12
seconds to do its stuff. Is this normal? There is no stop-the-world GC,
it just takes 12 seconds.

Configuration: 0.7.5 , 8GB Heap, 16GB machines. 7 * 64 MB memtables.



Re: Index interval tuning

2011-05-11 Thread Héctor Izquierdo Seliva
El mié, 11-05-2011 a las 14:24 +1200, aaron morton escribió:
 What version and what were the values for RecentBloomFilterFalsePositives and 
 BloomFilterFalsePositives ?
 
 The bloom filter metrics are updated in SSTableReader.getPosition() the only 
 slightly odd thing I can see is that we do not count a key cache hit a a true 
 positive for the bloom filter. If there were a lot of key cache hits and a 
 few false positives the ratio would be wrong. I'll ask around, does not seem 
 to apply to Hectors case though. 
 
 Cheers

0.7.5, and I am no longer using key cache. I get the bloom filter stats
via jmx. BloomFilterFalsePositiveRatio is always stuck at 1.0.
RecentBloomFilterFalsePositiveRation fluctuates from 0 to 1.0 with no
intermediate values. 

As for the index interval settings, I changed it from 128 to 256 and
memory consumption was just a tad lower but read performance was worse
by a few ms, so not much to gain there.

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11 May 2011, at 10:38, Chris Burroughs wrote:
 
  On 05/10/2011 02:12 PM, Peter Schuller wrote:
  That reminds me, my false positive ration is stuck at 1.0, so I guess
  bloom filters aren't doing a lot for me.
  
  That sounds unlikely unless you're hitting some edge case like reading
  a particular row that happened to be a collision, and only that row.
  This is from JMX stats on the column family store?
  
  
  (From jmx)  I also see BloomFilterFalseRatio stuck at 1.0 on my
  production nodes.  The only values that RecentBloomFilterFalseRatio had
  over the past several minutes were 0.0 and 1.0.  While I can't prove
  that isn't accurate, it is very suspicions.
  
  The code looked reasonable until I got to SSTableReader, which was too
  complicated to just glance through.
 




Re: Index interval tuning

2011-05-11 Thread Héctor Izquierdo Seliva
Sorry aaron, here are the values you requested

RecentBloomFilterFalsePositives = 5;
BloomFilterFalsePositives = 385260;

uptime of the node is three days and a half, more or less


El mié, 11-05-2011 a las 22:05 +1200, aaron morton escribió:

 What are the values for RecentBloomFilterFalsePositives and 
 BloomFilterFalsePositives the non ratio ones ?
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 11 May 2011, at 19:53, Héctor Izquierdo Seliva wrote:
 
  El mié, 11-05-2011 a las 14:24 +1200, aaron morton escribió:
  What version and what were the values for RecentBloomFilterFalsePositives 
  and BloomFilterFalsePositives ?
  
  The bloom filter metrics are updated in SSTableReader.getPosition() the 
  only slightly odd thing I can see is that we do not count a key cache hit 
  a a true positive for the bloom filter. If there were a lot of key cache 
  hits and a few false positives the ratio would be wrong. I'll ask around, 
  does not seem to apply to Hectors case though. 
  
  Cheers
  
  0.7.5, and I am no longer using key cache. I get the bloom filter stats
  via jmx. BloomFilterFalsePositiveRatio is always stuck at 1.0.
  RecentBloomFilterFalsePositiveRation fluctuates from 0 to 1.0 with no
  intermediate values. 
  
  As for the index interval settings, I changed it from 128 to 256 and
  memory consumption was just a tad lower but read performance was worse
  by a few ms, so not much to gain there.
  
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
  
  On 11 May 2011, at 10:38, Chris Burroughs wrote:
  
  On 05/10/2011 02:12 PM, Peter Schuller wrote:
  That reminds me, my false positive ration is stuck at 1.0, so I guess
  bloom filters aren't doing a lot for me.
  
  That sounds unlikely unless you're hitting some edge case like reading
  a particular row that happened to be a collision, and only that row.
  This is from JMX stats on the column family store?
  
  
  (From jmx)  I also see BloomFilterFalseRatio stuck at 1.0 on my
  production nodes.  The only values that RecentBloomFilterFalseRatio had
  over the past several minutes were 0.0 and 1.0.  While I can't prove
  that isn't accurate, it is very suspicions.
  
  The code looked reasonable until I got to SSTableReader, which was too
  complicated to just glance through.
  
  
  
 




Index interval tuning

2011-05-09 Thread Héctor Izquierdo Seliva
Hi everyone.

I have a few sstables with around 500 million keys, and memory usage has
grown a lot, I suppose because of the indexes. This sstables are
comprised of skinny rows, but a lot of them. Would tuning index interval
make the memory usage go down? And what would the performance hit be? 

I had to up heap from 5GB to 8GB and tune memtable thresholds way lower
than what I was using with less data.

I'm running 0.7.5 in a 6 machine cluster with RF=3. HW is quad core
intel machines with 16GB ram, and md raid0 on three sata disks.

Thanks all for your time!




Re: Index interval tuning

2011-05-09 Thread Héctor Izquierdo Seliva
El lun, 09-05-2011 a las 17:58 +0200, Peter Schuller escribió:
  I have a few sstables with around 500 million keys, and memory usage has
  grown a lot, I suppose because of the indexes. This sstables are
  comprised of skinny rows, but a lot of them. Would tuning index interval
  make the memory usage go down? And what would the performance hit be?
 
 Assuming no row caching, and assuming you're talking about heap usage
 and not the virtual size of the process in top, the primary two things
 that will grow with row count are (1) bloom filters for sstables and
 (2) the sampled index keys. Bloom filters are of a certain size to
 achieve a sufficiently small false positive rate. That target rate
 could be increased to allow smaller bloom filters, but that is not
 exposed as a configuration option and would require code changes.
 

No row cache and no key cache. I've tried with both, but the keys being
read are constantly changing, and I didn't see hit ratios beyond 0.8 %.

That reminds me, my false positive ration is stuck at 1.0, so I guess
bloom filters aren't doing a lot for me.

 For key sampling, the primary performance penalty should be CPU and
 maybe some disk. On average, when looking up a key an sstable index
 file, you'll read sample interval/2 entries and deserialize them
 before finding the one you're after. Increasing sampling interval will
 thus increase the amount of deserialization taking place, as well as
 make the average range of data span additional pages on disk. The
 impact on disk is difficult to judge and likely depends a lot on i/o
 scheduling and other details.
 

So the only thing I can do is test it and see how it goes. To make the
change affective, should I do anything beyond changing the value in
cassandra.yaml and restart the node? I'll try first with 256 and see
what happens.



Re: Problems recovering a dead node

2011-05-04 Thread Héctor Izquierdo Seliva
I'm sorry but I can't provide more detailed info as I have restarted the
node. After that the number of pending tasks started at 40, and rapidly
went down as compactions finished. After that, the ring looks ok, with
all the nodes having about the same amount of data. There were no errors
in the node log, only info messages. 

Should I run repair again just in case? The next time I have to recover
a node, is there a safer/faster way of doing it?

My guess about the number of pending tasks and sstables is that during
repair, it seemed to ask for ranges of the same column family a lot of
times, thus yielding a lot of tiny sstables. This caused minor
compactions to pile up. I have read about never ending repairs, or
repairs done a lot of times over small amounts of data on the mailing
lists. Could this be what happened?


Thanks!


El mié, 04-05-2011 a las 21:02 +1200, aaron morton escribió:
 Certainly sounds a bit sick. 
 
 The first error looks like it happens when the index file points to the wrong 
 place in the data file for the SSTable. The second one happens when the index 
 file is corrupted. The should be problems nodetool scrub can fix.
 
 The disk space may be dead space to cassandra compaction or some other 
 streaming failure. You can check how much it considers to be live (in use) 
 space using nodetool cfstats. This will also tell you how many sstables are 
 live. Having a lot of dead SSTables is not necessarily a bad thing. 
 
 What are the pending tasks ? what is nodetool tpstats showing ? And what does 
 nodetool ring show from one of the other nodes ? 
 
 I'm assuming there are no errors in the logs on the node. What are the most 
 recent INFO messages?
 
 Hope that helps. 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 4 May 2011, at 17:54, Héctor Izquierdo Seliva wrote:
 
  
  Hi Aaron
  
  It has no data files whatsoever. The upgrade path is 0.7.4 - 0.7.5. It
  turns out the initial problem was the sw raid failing silently because
  of another faulty disk.
  
  Now that the storage is working, I brought up the node again, same IP,
  same token and tried doing nodetool repair. 
  
  All adjacent nodes have finished the streaming session, and now the node
  has a total of 248 GB of data. Is this normal when the load per node is
  about 18GB? 
  
  Also there are 1245 pending tasks. It's been compacting or rebuilding
  sstables for the last 8 hours non stop. There are 2057 sstables in the
  data folder.
  
  Should I have done thing differently or is this the normal behaviour?
  
  Thanks!
  
  El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió:
  When you say it's clean does that mean the node has no data files ?
  
  After you replaced the disk what process did you use to recover  ?
  
  Also what version are you running and what's the recent upgrade history ?
  
  Cheers
  Aaron
  
  On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:
  
  Hi everyone. One of the nodes in my 6 node cluster died with disk
  failures. I have replaced the disks, and it's clean. It has the same
  configuration (same ip, same token).
  
  When I try to restart the node it starts to throw mmap underflow
  exceptions till it closes again.
  
  I tried setting io to standard, but it still fails. It gives errors
  about two decorated keys being different, and the EOFException.
  
  Here is an excerpt of the log
  
  http://pastebin.com/ZXW1wY6T
  
  I can provide more info if needed. I'm at a loss here so any help is
  appreciated.
  
  Thanks all for your time
  
  Héctor Izquierdo
  
  
  
  
 




Re: Problems recovering a dead node

2011-05-04 Thread Héctor Izquierdo Seliva

 
 El mié, 04-05-2011 a las 21:02 +1200, aaron morton escribió:
  Certainly sounds a bit sick. 
  
  The first error looks like it happens when the index file points to the 
  wrong place in the data file for the SSTable. The second one happens when 
  the index file is corrupted. The should be problems nodetool scrub can fix.
  
  The disk space may be dead space to cassandra compaction or some other 
  streaming failure. You can check how much it considers to be live (in use) 
  space using nodetool cfstats. This will also tell you how many sstables are 
  live. Having a lot of dead SSTables is not necessarily a bad thing. 
  
  What are the pending tasks ? what is nodetool tpstats showing ? And what 
  does nodetool ring show from one of the other nodes ? 
  
  I'm assuming there are no errors in the logs on the node. What are the most 
  recent INFO messages?
  
  Hope that helps. 
  


In the end I have had to run repair again, as I was getting old data
back. It seems I'm having the same problem again. Here is my cfstats:

http://pastebin.com/B9eD3b4R

I have 796 sstables for a total of 108GB (and counting) on my data
folder. Almost all of them come from streaming. 694 pending operations.

Here is my ring info:


10.20.13.75 Up Normal  16.99 GB16.67%
28356863910078205288614550619314017621  
10.20.13.76 Up Normal  26.76 GB16.67%
56713727820156410577229101238628035242  
10.20.13.77 Up Normal  28.23 GB16.67%
85070591730234615865843651857942052863  
10.20.13.78 Up Normal  29.19 GB16.67%
113427455640312821154458202477256070484 
10.20.13.79 Up Normal  27.71 GB16.67%
141784319550391026443072753096570088105 
10.20.13.80 Up Normal  25.36 GB16.67%
170141183460469231731687303715884105727

And here is the output of tpstats:

Pool NameActive   Pending  Completed
ReadStage 0 06943016
RequestResponseStage  0 0   15011243
MutationStage 0 0 964296
ReadRepairStage   0 04064197
GossipStage   0 0  59499
AntiEntropyStage  0 0 77
MigrationStage0 0  0
MemtablePostFlusher   0 0 14
StreamStage   0 0  0
FlushWriter   0 0 14
FILEUTILS-DELETE-POOL 0 0  2
MiscStage 0 0 83
FlushSorter   0 0  0
InternalResponseStage 0 0  0
HintedHandoff 0 0  6



Problems recovering a dead node

2011-05-03 Thread Héctor Izquierdo Seliva
Hi everyone. One of the nodes in my 6 node cluster died with disk
failures. I have replaced the disks, and it's clean. It has the same
configuration (same ip, same token).

When I try to restart the node it starts to throw mmap underflow
exceptions till it closes again.

I tried setting io to standard, but it still fails. It gives errors
about two decorated keys being different, and the EOFException.

Here is an excerpt of the log

http://pastebin.com/ZXW1wY6T

I can provide more info if needed. I'm at a loss here so any help is
appreciated.

Thanks all for your time

Héctor Izquierdo



Re: Problems recovering a dead node

2011-05-03 Thread Héctor Izquierdo Seliva

Hi Aaron

It has no data files whatsoever. The upgrade path is 0.7.4 - 0.7.5. It
turns out the initial problem was the sw raid failing silently because
of another faulty disk.

Now that the storage is working, I brought up the node again, same IP,
same token and tried doing nodetool repair. 

All adjacent nodes have finished the streaming session, and now the node
has a total of 248 GB of data. Is this normal when the load per node is
about 18GB? 

Also there are 1245 pending tasks. It's been compacting or rebuilding
sstables for the last 8 hours non stop. There are 2057 sstables in the
data folder.

Should I have done thing differently or is this the normal behaviour?

Thanks!

El mié, 04-05-2011 a las 07:54 +1200, aaron morton escribió:
 When you say it's clean does that mean the node has no data files ?
 
 After you replaced the disk what process did you use to recover  ?
 
 Also what version are you running and what's the recent upgrade history ?
 
 Cheers
 Aaron
 
 On 3 May 2011, at 23:09, Héctor Izquierdo Seliva wrote:
 
  Hi everyone. One of the nodes in my 6 node cluster died with disk
  failures. I have replaced the disks, and it's clean. It has the same
  configuration (same ip, same token).
  
  When I try to restart the node it starts to throw mmap underflow
  exceptions till it closes again.
  
  I tried setting io to standard, but it still fails. It gives errors
  about two decorated keys being different, and the EOFException.
  
  Here is an excerpt of the log
  
  http://pastebin.com/ZXW1wY6T
  
  I can provide more info if needed. I'm at a loss here so any help is
  appreciated.
  
  Thanks all for your time
  
  Héctor Izquierdo
  
 




Re: Tombstones and memtable_operations

2011-04-20 Thread Héctor Izquierdo Seliva
El mié, 20-04-2011 a las 23:00 +1200, aaron morton escribió:
 Looks like a bug, I've added a patch
 here https://issues.apache.org/jira/browse/CASSANDRA-2519
 
 
 Aaron
 

That was fast! Thanks Aaron




Re: How to warm up a cold node

2011-04-19 Thread Héctor Izquierdo Seliva
Shouldn't the dynamic snitch take into account response times and ask a
slow node for less requests? It seems that at node startup, only a
handfull of requests arrive to the node and it keeps up well, but
there's moment where there's more than it can handle with a cold cache
and starts droping messages like crazy.

Could it be the transition from slow node to ok node is to steep?


El vie, 15-04-2011 a las 16:19 +0200, Peter Schuller escribió:
  Hi everyone, is there any recommended procedure to warm up a node before
  bringing it up?
 
 Currently the only out-of-the-box support for warming up caches is
 that implied by the key cache and row cache, which will pre-heat on
 start-up. Indexes will be indirectly preheated by index sampling, to
 the extent that they operating system retains them in page cache.
 
 If you're wanting to pre-heat sstables there's currently no way to do
 that (but it's a useful feature to have). Pragmatically, you can
 script something that e.g. does cat path/to/keyspace/*  /dev/null
 or similar. But that only works if the total database size fits
 reasonably well in page cache.
 
 Pre-heating sstables on a per-cf basis on start-up would be a nice
 feature to have.
 




Tombstones and memtable_operations

2011-04-19 Thread Héctor Izquierdo Seliva
Hi everyone. I've configured in one of my column families
memtable_operations = 0.02 and started deleting keys. I have already
deleted 54k, but there hasn't been any flush of the memtable. Memory
keeps pilling up and eventually nodes start to do stop-the-world GCs. Is
this the way this is supposed to work or have I done something wrong?

Thanks!



Re: Tombstones and memtable_operations

2011-04-19 Thread Héctor Izquierdo Seliva
Ok, I've read about gc grace seconds, but i'm not sure I understand it
fully. Untill gc grace seconds have passed, and there is a compaction,
the tombstones live in memory? I have to delete 100 million rows and my
insert rate is very low, so I don't have a lot of compactions. What
should I do in this case? Lower the major compaction threshold and
memtable_operations to some very low number?

Thanks

El mar, 19-04-2011 a las 17:36 +0200, Héctor Izquierdo Seliva escribió:
 Hi everyone. I've configured in one of my column families
 memtable_operations = 0.02 and started deleting keys. I have already
 deleted 54k, but there hasn't been any flush of the memtable. Memory
 keeps pilling up and eventually nodes start to do stop-the-world GCs. Is
 this the way this is supposed to work or have I done something wrong?
 
 Thanks!
 




Re: How to warm up a cold node

2011-04-19 Thread Héctor Izquierdo Seliva
El mié, 20-04-2011 a las 07:59 +1200, aaron morton escribió:
 The dynamic snitch only reduces the chance that a node used in a read 
 operation, it depends on the RF, the CL for the operation, the partitioner 
 and possibly the network topology. Dropping read messages is ok, so long as 
 your operation completes at the requested CL.  
 
 Are you using either a key_cache or a row_cache ? If so have you enabled the 
 background save for those as part of the column definition?
 
 If this is about getting the OS caches warmed up you should see the pending 
 count on the READ stage backup up and the io stats start to show longer 
 queues. In that case Peters suggestion below is prob the best option.
 
 Another hacky way to warm things would be to use get_count() on rows the app 
 thiks it may need to use. This will cause all the columns to be read from 
 disk, but not sent over the network. 
 
 Aaron

The rows that need to be readed change over time and they are only read
for short amounts of time, so it's a bit tricky. I'll try to figure out
a way to use your suggestions.

In another thread I asked if it would be feasible to somehow store what
parts of the sstables are hot on shutdown and re read those parts of the
files on startup. Could it be done in a similar way to the work that's
being done on page migrations? What do you think?

Thanks for your time!


   
 On 20 Apr 2011, at 00:41, Héctor Izquierdo Seliva wrote:
 
  Shouldn't the dynamic snitch take into account response times and ask a
  slow node for less requests? It seems that at node startup, only a
  handfull of requests arrive to the node and it keeps up well, but
  there's moment where there's more than it can handle with a cold cache
  and starts droping messages like crazy.
  
  Could it be the transition from slow node to ok node is to steep?
  
  
  El vie, 15-04-2011 a las 16:19 +0200, Peter Schuller escribió:
  Hi everyone, is there any recommended procedure to warm up a node before
  bringing it up?
  
  Currently the only out-of-the-box support for warming up caches is
  that implied by the key cache and row cache, which will pre-heat on
  start-up. Indexes will be indirectly preheated by index sampling, to
  the extent that they operating system retains them in page cache.
  
  If you're wanting to pre-heat sstables there's currently no way to do
  that (but it's a useful feature to have). Pragmatically, you can
  script something that e.g. does cat path/to/keyspace/*  /dev/null
  or similar. But that only works if the total database size fits
  reasonably well in page cache.
  
  Pre-heating sstables on a per-cf basis on start-up would be a nice
  feature to have.
  
  
  
 




Re: Tombstones and memtable_operations

2011-04-19 Thread Héctor Izquierdo Seliva

El mié, 20-04-2011 a las 08:16 +1200, aaron morton escribió:
 I think their may be an issue here, we are counting the number of columns in 
 the operation. When deleting an entire row we do not have a column count. 
 
 Can you let us know what version you are using and how you are doing the 
 delete ? 
 
 Thanks
 Aaron
 

I'm using 0.7.4. I have a file with all the row keys I have to delete
(around 100 million) and I just go through the file and issue deletes
through pelops. 

Should I manually issue flushes with a cron every x time?

 On 20 Apr 2011, at 04:21, Héctor Izquierdo Seliva wrote:
 
  Ok, I've read about gc grace seconds, but i'm not sure I understand it
  fully. Untill gc grace seconds have passed, and there is a compaction,
  the tombstones live in memory? I have to delete 100 million rows and my
  insert rate is very low, so I don't have a lot of compactions. What
  should I do in this case? Lower the major compaction threshold and
  memtable_operations to some very low number?
  
  Thanks
  
  El mar, 19-04-2011 a las 17:36 +0200, Héctor Izquierdo Seliva escribió:
  Hi everyone. I've configured in one of my column families
  memtable_operations = 0.02 and started deleting keys. I have already
  deleted 54k, but there hasn't been any flush of the memtable. Memory
  keeps pilling up and eventually nodes start to do stop-the-world GCs. Is
  this the way this is supposed to work or have I done something wrong?
  
  Thanks!
  
  
  
 




Re: Tombstones and memtable_operations

2011-04-19 Thread Héctor Izquierdo Seliva
El mar, 19-04-2011 a las 23:33 +0300, shimi escribió:
 You can use memtable_flush_after_mins instead of the cron
 
 
 Shimi
 

Good point! I'll try that.

Wouldn't it be better to count a delete as a one column operation so it
contributes to flush by operations?

 2011/4/19 Héctor Izquierdo Seliva izquie...@strands.com
 
 El mié, 20-04-2011 a las 08:16 +1200, aaron morton escribió:
  I think their may be an issue here, we are counting the
 number of columns in the operation. When deleting an entire
 row we do not have a column count.
 
  Can you let us know what version you are using and how you
 are doing the delete ?
 
  Thanks
  Aaron
 
 
 
 I'm using 0.7.4. I have a file with all the row keys I have to
 delete
 (around 100 million) and I just go through the file and issue
 deletes
 through pelops.
 
 Should I manually issue flushes with a cron every x time?
 
 
  On 20 Apr 2011, at 04:21, Héctor Izquierdo Seliva wrote:
 
   Ok, I've read about gc grace seconds, but i'm not sure I
 understand it
   fully. Untill gc grace seconds have passed, and there is a
 compaction,
   the tombstones live in memory? I have to delete 100
 million rows and my
   insert rate is very low, so I don't have a lot of
 compactions. What
   should I do in this case? Lower the major compaction
 threshold and
   memtable_operations to some very low number?
  
   Thanks
  
   El mar, 19-04-2011 a las 17:36 +0200, Héctor Izquierdo
 Seliva escribió:
   Hi everyone. I've configured in one of my column families
   memtable_operations = 0.02 and started deleting keys. I
 have already
   deleted 54k, but there hasn't been any flush of the
 memtable. Memory
   keeps pilling up and eventually nodes start to do
 stop-the-world GCs. Is
   this the way this is supposed to work or have I done
 something wrong?
  
   Thanks!
  
  
  
 
 
 
 
 
 




Re: Tombstones and memtable_operations

2011-04-19 Thread Héctor Izquierdo Seliva
I poste it a couple of messages back, but here it is again:

I'm using 0.7.4. I have a file with all the row keys I have to delete
(around 100 million) and I just go through the file and issue deletes
 through pelops. Should I manually issue flushes with a cron every x
time?



Re: RE: batch_mutate failed: out of sequence response

2011-04-18 Thread Héctor Izquierdo Seliva
Thanks Dan for fixing that! Is the change integrated in the latest maven
snapshot?

El mar, 19-04-2011 a las 10:48 +1000, Dan Washusen escribió:
 An example scenario (that is now fixed in Pelops):
  1. Attempt to write a column with a null value
  2. Cassandra throws a TProtocolException which renders the
 connection useless for future operations
  3. Pelops returns the corrupt connection to the pool
  4. A second read operation is attempted with the corrupt
 connection and Cassandra throws an ApplicationException
 A Pelops test case for this can be found here:
 https://github.com/s7/scale7-pelops/blob/3fe7584a24bb4b62b01897a814ef62415bd2fe43/src/test/java/org/scale7/cassandra/pelops/MutatorIntegrationTest.java#L262
 
 
 Cheers,
 -- 
 Dan Washusen
 
 
 On Tuesday, 19 April 2011 at 10:28 AM, Jonathan Ellis wrote:
 
  Any idea what's causing the original TPE?
  
  On Mon, Apr 18, 2011 at 6:22 PM, Dan Washusen d...@reactive.org
  wrote:
   It turns out that once a TProtocolException is thrown from
   Cassandra the
   connection is useless for future operations. Pelops was closing
   connections
   when it detected TimedOutException, TTransportException and
   UnavailableException but not TProtocolException.  We have now
   changed Pelops
   to close connections is all cases *except* NotFoundException.
   
   Cheers,
   --
   Dan Washusen
   
   On Friday, 8 April 2011 at 7:28 AM, Dan Washusen wrote:
   
   Pelops uses a single connection per operation from a pool that is
   backed by
   Apache Commons Pool (assuming you're using Cassandra 0.7).  I'm
   not saying
   it's perfect but it's NOT sharing a connection over multiple
   threads.
   Dan Hendry mentioned that he sees these errors.  Is he also using
   Pelops?
From his comment about retrying I'd assume not...
   
   --
   Dan Washusen
   
   On Thursday, 7 April 2011 at 7:39 PM, Héctor Izquierdo Seliva
   wrote:
   
   El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió:
   
   out of sequence response is thrift's way of saying I got a
   response
   for request Y when I expected request X.
   
   my money is on using a single connection from multiple threads.
   don't do
   that.
   
   I'm not using thrift directly, and my application is single
   thread, so I
   guess this is Pelops fault somehow. Since I managed to tame memory
   comsuption the problem has not appeared again, but it always
   happened
   during a stop-the-world GC. Could it be that the message was sent
   instead of being dropped by the server when the client assumed it
   had
   timed out?
   
  
  
  
  -- 
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra
  support
  http://www.datastax.com
  
 
 




How to warm up a cold node

2011-04-15 Thread Héctor Izquierdo Seliva
Hi everyone, is there any recommended procedure to warm up a node before
bringing it up? 

Thanks!



Re: How to warm up a cold node

2011-04-15 Thread Héctor Izquierdo Seliva
How difficult do you think this could be? I would be interested into
developing this if it's feasible.

El vie, 15-04-2011 a las 16:19 +0200, Peter Schuller escribió:
  Hi everyone, is there any recommended procedure to warm up a node before
  bringing it up?
 
 Currently the only out-of-the-box support for warming up caches is
 that implied by the key cache and row cache, which will pre-heat on
 start-up. Indexes will be indirectly preheated by index sampling, to
 the extent that they operating system retains them in page cache.
 
 If you're wanting to pre-heat sstables there's currently no way to do
 that (but it's a useful feature to have). Pragmatically, you can
 script something that e.g. does cat path/to/keyspace/*  /dev/null
 or similar. But that only works if the total database size fits
 reasonably well in page cache.
 
 Pre-heating sstables on a per-cf basis on start-up would be a nice
 feature to have.
 




Re: Strange readRepairChance in server logs

2011-04-12 Thread Héctor Izquierdo Seliva
Thanks Aaron!

El mar, 12-04-2011 a las 23:52 +1200, aaron morton escribió:
 Bug in the CLI, created /
 fixed https://issues.apache.org/jira/browse/CASSANDRA-2458
 
 
 use 70 for now. 
 
 
 Thanks
 Aaron
 
 
 On 12 Apr 2011, at 20:46, Héctor Izquierdo Seliva wrote:
 
 
 
  Hi everyone.
  
  I've changed the read repair chance of one of my column families
  from
  cassandra-cli with the following entry:
  
  update column family cf with read_repair_chance = 0.7
  
  I expected to see in the server log 
  
  readRepairChance=0.7
  
  Instead I saw this
  
  readRepairChance=0.006999,
  
  Should I use read_repair_chance = 70 instead of 0.7?
  
  
 
 
 



Cassandra monitoring tool

2011-04-12 Thread Héctor Izquierdo Seliva
Hi everyone.

Looking for ways to monitor cassandra with zabbix I could not found
anything that was really usable, till I found mention of a nice class by
smeet. I have based my modification upon his work and now I give it back
to the community.

Here's the project url:

http://code.google.com/p/simple-cassandra-monitoring/

It allows to get statistics for any Keyspace/ColumnFamily you want. To
start it just build the jar, and launch it using as classpath your
cassandra installation lib folder.

The first parameter is the node host name. The second parameter is a
comma separated list of KS:CF values. For example:

java -cp blablabla localhost ks1:cf1,ks1:cf2.

Then point curl to http://localhost:9090/ks1/cf1 and some basic stats
will be displayed.

You can also point to http://localhost:9090/nodeinfo to get some info
about the server.

If you have any suggestion or improvement you would like to see, please
contact me and I will be glad to work on it. Right now it's a bit rough,
but it gets the job done.

Thanks for your time!



Re: Cassandra monitoring tool

2011-04-12 Thread Héctor Izquierdo Seliva
El mar, 12-04-2011 a las 21:24 +0500, Ali Ahsan escribió:
 Thanks for sharing this info,I am getting following error,Can please be 
 more specific how can i run this
 
 
 java -cp 
 /home/ali/apache-cassandra-0.6.3/lib/simple-cassandra-monitoring-1.0.jar 
 127.0.0.1 ks1:cf1,ks1:cf2
 Exception in thread main java.lang.NoClassDefFoundError: 127/0/0/1
 Caused by: java.lang.ClassNotFoundException: 127.0.0.1
  at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
 Could not find the main class: 127.0.0.1. Program will exit.
 
 
  
  OR
 
 java -jar 
 /home/ali/apache-cassandra-0.6.3/lib/simple-cassandra-monitoring-1.0.jar  
 localhost 
 ks1:cf1,ks1:cf2
 
 Failed to load Main-Class manifest attribute from
 /home/ali/apache-cassandra-0.6.3/lib/simple-cassandra-monitoring-1.0.jar
 
 

Hi Ali. You should run it like this

java -cp /home/ali/apache-cassandra-0.6.3/lib/*
com.google.code.scm.CassandraMonitoring localhost ks1:cf1,ks2:cf2,etc

I forgot to mention it has been coded against 0.7.x, and I'm not sure it
will work on 0.6.x. I'll try to add support for both 0.6.x and the new
0.8.x version as soon as possible.

 
 On 04/12/2011 07:26 PM, Héctor Izquierdo Seliva wrote:
  Hi everyone.
 
  Looking for ways to monitor cassandra with zabbix I could not found
  anything that was really usable, till I found mention of a nice class by
  smeet. I have based my modification upon his work and now I give it back
  to the community.
 
  Here's the project url:
 
  http://code.google.com/p/simple-cassandra-monitoring/
 
  It allows to get statistics for any Keyspace/ColumnFamily you want. To
  start it just build the jar, and launch it using as classpath your
  cassandra installation lib folder.
 
  The first parameter is the node host name. The second parameter is a
  comma separated list of KS:CF values. For example:
 
  java -cp blablabla localhost ks1:cf1,ks1:cf2.
 
  Then point curl to http://localhost:9090/ks1/cf1 and some basic stats
  will be displayed.
 
  You can also point to http://localhost:9090/nodeinfo to get some info
  about the server.
 
  If you have any suggestion or improvement you would like to see, please
  contact me and I will be glad to work on it. Right now it's a bit rough,
  but it gets the job done.
 
  Thanks for your time!
 
 
 
 
 




Re: Cassandra monitoring tool

2011-04-12 Thread Héctor Izquierdo Seliva
I'm not sure. Are you runing it in the same host as the cassandra node?

El mar, 12-04-2011 a las 22:54 +0500, Ali Ahsan escribió:
 On 04/12/2011 10:42 PM, Héctor Izquierdo Seliva wrote:
 
  I forgot to mention it has been coded against 0.7.x, and I'm not sure it
  will work on 0.6.x. I'll try to add support for both 0.6.x and the new
  0.8.x version as soon as possible.
 
 
 
 I think these error  is because of 0.6.3 ?
 
 
 
 xception in thread main java.io.IOException: Failed to retrieve 
 RMIServer stub: javax.naming.CommunicationException [Root exception is 
 java.rmi.ConnectIOException: error during JRMP connection establishment; 
 nested exception is:
  java.io.EOFException]
  at 
 javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:342)
  at 
 javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:267)
  at 
 com.google.code.scm.CassandraMonitoring.start(CassandraMonitoring.java:58)
  at 
 com.google.code.scm.CassandraMonitoring.main(CassandraMonitoring.java:190)
 Caused by: javax.naming.CommunicationException [Root exception is 
 java.rmi.ConnectIOException: error during JRMP connection establishment; 
 nested exception is:
  java.io.EOFException]
  at 
 com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:118)
  at 
 com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:203)
  at javax.naming.InitialContext.lookup(InitialContext.java:409)
  at 
 javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1902)
  at 
 javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1871)
  at 
 javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:276)
  ... 3 more
 Caused by: java.rmi.ConnectIOException: error during JRMP connection 
 establishment; nested exception is:
  java.io.EOFException
  at 
 sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:304)
  at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:202)
  at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:340)
  at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source)
  at 
 com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:114)
  ... 8 more
 Caused by: java.io.EOFException
  at java.io.DataInputStream.readByte(DataInputStream.java:267)
  at 
 sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:246)
  ... 12 more
 
 




Re: RE: batch_mutate failed: out of sequence response

2011-04-07 Thread Héctor Izquierdo Seliva


El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió:

 out of sequence response is thrift's way of saying I got a response
 for request Y when I expected request X.
 
 my money is on using a single connection from multiple threads.  don't do 
 that.


I'm not using thrift directly, and my application is single thread, so I
guess this is Pelops fault somehow. Since I managed to tame memory
comsuption the problem has not appeared again, but it always happened
during a stop-the-world GC. Could it be that the message was instead of
being dropped by the server?


Re: RE: batch_mutate failed: out of sequence response

2011-04-07 Thread Héctor Izquierdo Seliva
El mié, 06-04-2011 a las 21:04 -0500, Jonathan Ellis escribió:
 out of sequence response is thrift's way of saying I got a response
 for request Y when I expected request X.
 
 my money is on using a single connection from multiple threads.  don't do 
 that.
 

I'm not using thrift directly, and my application is single thread, so I
guess this is Pelops fault somehow. Since I managed to tame memory
comsuption the problem has not appeared again, but it always happened
during a stop-the-world GC. Could it be that the message was sent
instead of being dropped by the server when the client assumed it had
timed out?





Re: Disable Swap? batch_mutate failed: out of sequence response

2011-04-06 Thread Héctor Izquierdo Seliva
I took a look at vmstats, and there was no swap. Also, our monitoring
tools showed no swap being used at all. It's running with mlockall and
all that. 8GB heap on a 16GB machine

El mar, 05-04-2011 a las 21:24 +0200, Peter Schuller escribió:
  Would you recommend to disable system swap as a rule?   I'm running on 
  Debian 64bit and am seeing light swapping:
 
 I'm not Jonathan, but *yes*. I would go so far as to say that
 disabling swap is a good rule of thumb for *most* production systems
 that serve latency sensitive traffic. For a machine dedicated to
 Cassandra, you definitely want to disable swap.
 
 There's just nothing to be gained really, but lots to loose. There is
 nothing that you *want* swapped out. All the memory you use needs to
 be in memory. You don't want the heap swapped out, you don't want the
 off-heap jvm malloc stuff swapped out, you don't want stacks swapped
 out,etc. As soon as you start swapping you very quickly run into poor
 and unreliable performance. Particularly during GC.
 




Re: RE: batch_mutate failed: out of sequence response

2011-04-06 Thread Héctor Izquierdo Seliva
El mié, 06-04-2011 a las 09:06 +1000, Dan Washusen escribió:
 Pelops raises a RuntimeException?  Can you provide more info please?
 

org.scale7.cassandra.pelops.exceptions.ApplicationException:
batch_mutate failed: out of sequence response

 -- 
 Dan Washusen
 Make big files fly
 visit digitalpigeon.com
 
 On Tuesday, 5 April 2011 at 11:43 PM, Héctor Izquierdo Seliva wrote:
 
  El mar, 05-04-2011 a las 09:35 -0400, Dan Hendry escribió:
   I too have seen the out of sequence response problem. My solution
   has just been to retry and it seems to work. None of my mutations
   are THAT large ( 200 columns). 
   
   The only related information I could find points to a
   thrift/ubuntu bug of some kind
   (http://markmail.org/message/xc3tskhhvsf5awz7). What OS are you
   running? 
   
   Dan
   
  
  Hi Dan. I'm running on Debian stable and cassandra 0.7.4. I have
  rows
  with up to 1000 columns. I have changed the way I was doing the
  batch
  mutates to never be bigger than 100 columns at a time. I hope this
  will
  work, otherwise the move is going to take too long.
  
  The problem is aggravated by Pelops not retrying automatically and
  instead raising a RuntimeException.
  
  I'll try to add a retry if this doesn't work. 
  
  Thanks for your response!
  
  Héctor
  
   -Original Message-
   From: Héctor Izquierdo Seliva [mailto:izquie...@strands.com] 
   Sent: April-05-11 8:30
   To: user@cassandra.apache.org
   Subject: batch_mutate failed: out of sequence response
   
   Hi everyone. I'm having trouble while inserting big amounts of
   data into
   cassandra. I'm getting this exception:
   
   batch_mutate failed: out of sequence response
   
   I'm gessing is due to very big mutates. I have made the batch
   mutates
   smaller and it seems to be behaving. Can somebody shed some light?
   
   Thanks!
   
   No virus found in this incoming message.
   Checked by AVG - www.avg.com 
   Version: 9.0.894 / Virus Database: 271.1.1/3551 - Release Date:
   04/05/11 02:34:00
   
 
 




Re: Disable Swap? batch_mutate failed: out of sequence response

2011-04-06 Thread Héctor Izquierdo Seliva
El mié, 06-04-2011 a las 09:18 +0200, Héctor Izquierdo Seliva escribió:
 I took a look at vmstats, and there was no swap. Also, our monitoring
 tools showed no swap being used at all. It's running with mlockall and
 all that. 8GB heap on a 16GB machine
 

I tried disabling swap completely, and the problem is still appearing.

 El mar, 05-04-2011 a las 21:24 +0200, Peter Schuller escribió:
   Would you recommend to disable system swap as a rule?   I'm running on 
   Debian 64bit and am seeing light swapping:
  
  I'm not Jonathan, but *yes*. I would go so far as to say that
  disabling swap is a good rule of thumb for *most* production systems
  that serve latency sensitive traffic. For a machine dedicated to
  Cassandra, you definitely want to disable swap.
  
  There's just nothing to be gained really, but lots to loose. There is
  nothing that you *want* swapped out. All the memory you use needs to
  be in memory. You don't want the heap swapped out, you don't want the
  off-heap jvm malloc stuff swapped out, you don't want stacks swapped
  out,etc. As soon as you start swapping you very quickly run into poor
  and unreliable performance. Particularly during GC.
  
 
 




batch_mutate failed: out of sequence response

2011-04-05 Thread Héctor Izquierdo Seliva
Hi everyone. I'm having trouble while inserting big amounts of data into
cassandra. I'm getting this exception:

batch_mutate failed: out of sequence response

I'm gessing is due to very big mutates. I have made the batch mutates
smaller and it seems to be behaving. Can somebody shed some light?

Thanks!



RE: batch_mutate failed: out of sequence response

2011-04-05 Thread Héctor Izquierdo Seliva
El mar, 05-04-2011 a las 09:35 -0400, Dan Hendry escribió:
 I too have seen the out of sequence response problem. My solution has just 
 been to retry and it seems to work. None of my mutations are THAT large ( 
 200 columns). 
 
 The only related information I could find points to a thrift/ubuntu bug of 
 some kind (http://markmail.org/message/xc3tskhhvsf5awz7). What OS are you 
 running? 
 
 Dan
 

Hi Dan. I'm running on Debian stable and cassandra 0.7.4. I have rows
with up to 1000 columns. I have changed the way I was doing the batch
mutates to never be bigger than 100 columns at a time. I hope this will
work, otherwise the move is going to take too long.

The problem is aggravated by Pelops not retrying automatically and
instead raising a RuntimeException.

I'll try to add a retry if this doesn't work. 

Thanks for your response!

Héctor

 -Original Message-
 From: Héctor Izquierdo Seliva [mailto:izquie...@strands.com] 
 Sent: April-05-11 8:30
 To: user@cassandra.apache.org
 Subject: batch_mutate failed: out of sequence response
 
 Hi everyone. I'm having trouble while inserting big amounts of data into
 cassandra. I'm getting this exception:
 
 batch_mutate failed: out of sequence response
 
 I'm gessing is due to very big mutates. I have made the batch mutates
 smaller and it seems to be behaving. Can somebody shed some light?
 
 Thanks!
 
 No virus found in this incoming message.
 Checked by AVG - www.avg.com 
 Version: 9.0.894 / Virus Database: 271.1.1/3551 - Release Date: 04/05/11 
 02:34:00
 




RE: batch_mutate failed: out of sequence response

2011-04-05 Thread Héctor Izquierdo Seliva
I'm still running into problems. Now I don't write more than 100 columns
at a time, and I'm having lots of Stop-the-world gc pauses.

I'm writing into three column families, with memtable_operations = 0.3
and memtable_throughput = 64. 

Is any of this wrong?

  -Original Message-
  From: Héctor Izquierdo Seliva [mailto:izquie...@strands.com] 
  Sent: April-05-11 8:30
  To: user@cassandra.apache.org
  Subject: batch_mutate failed: out of sequence response
  
  Hi everyone. I'm having trouble while inserting big amounts of data into
  cassandra. I'm getting this exception:
  
  batch_mutate failed: out of sequence response
  
  I'm gessing is due to very big mutates. I have made the batch mutates
  smaller and it seems to be behaving. Can somebody shed some light?
  
  Thanks!
  
  No virus found in this incoming message.
  Checked by AVG - www.avg.com 
  Version: 9.0.894 / Virus Database: 271.1.1/3551 - Release Date: 04/05/11 
  02:34:00
  
 
 




RE: batch_mutate failed: out of sequence response

2011-04-05 Thread Héctor Izquierdo Seliva
Update with more info:

I'm still running into problems. Now I don't write more than 100 columns
at a time, and I'm having lots of Stop-the-world gc pauses.

I'm writing into three column families, with memtable_operations = 0.3 
and memtable_throughput = 64. There is now swapping, and full GCs are taking 
around 5 seconds. I'm running cassandra with a heap of 8 GB. Should I tune this 
somehow?

Is any of this wrong?
 
   -Original Message-
   From: Héctor Izquierdo Seliva [mailto:izquie...@strands.com] 
   Sent: April-05-11 8:30
   To: user@cassandra.apache.org
   Subject: batch_mutate failed: out of sequence response
   
   Hi everyone. I'm having trouble while inserting big amounts of data into
   cassandra. I'm getting this exception:
   
   batch_mutate failed: out of sequence response
   
   I'm gessing is due to very big mutates. I have made the batch mutates
   smaller and it seems to be behaving. Can somebody shed some light?
   
   Thanks!
   
   No virus found in this incoming message.
   Checked by AVG - www.avg.com 
   Version: 9.0.894 / Virus Database: 271.1.1/3551 - Release Date: 04/05/11 
   02:34:00
   
  
  
 
 




Re: How to use NetworkTopologyStrategy

2011-02-21 Thread Héctor Izquierdo Seliva
Thanks! I totally overlooked that.

El lun, 21-02-2011 a las 08:14 +1300, Aaron Morton escribió:

 The best examples I know of are in the internal cli help, and 
 conf/casandra.yaml
 Aaron
 
 On 19/02/2011, at 12:51 AM, Héctor Izquierdo Seliva izquie...@strands.com 
 wrote:
 
  Hi!
  
  Can some body give me some hints about how to configure a keyspace with
  NetworkTopologyStrategy via cassandra-cli? Or what is the preferred
  method to do so?
  
  Thanks!
  




Replicate changes from DC1 to DC2, but not from DC2 to DC1

2011-02-21 Thread Héctor Izquierdo Seliva

Hi all.

Is there a way (besides changing the code) to replicate data from a Data
center 1 to a Data center 2, but not the other way around? I need to
have a preproduction environment with production data, and ideally with
only a fraction of the data (for example, by key preffixes). I have
poked around StorageProxy and I can make writes in DC2 not replicate to
DC1, and as long as I use DC_QUORUM it stays that way, but it
looks...dangerous. I could do a full key scan but it would take too
long.

Have anybody done something similar?

Thanks!




millions of columns in a row vs millions of rows with one column

2011-02-21 Thread Héctor Izquierdo Seliva
Hi Everyone.

I'm testing performance differences of millions of columns in a row vs
millions of rows. So far it seems wide rows perform better in terms of
reads, but there can be potentially hundreds of millions of columns in a
row. Is this going to be a problem? Should I go with individual rows? I
run 6 nodes with 7.2 and a RF=3.

Thanks for your help!



Re: Replicate changes from DC1 to DC2, but not from DC2 to DC1

2011-02-21 Thread Héctor Izquierdo Seliva
El mar, 22-02-2011 a las 08:46 +1300, Aaron Morton escribió:
 Take a look at the NetworkTopologyStrategy and/or the RackInferringSnitch 
 together they  decide where to place replicas. It's probably not a great idea 
 to muck around with this stuff though.
 
 How about a hadoop job to pull out the data you want? It would be a full scan 
 but in parallel.

I looked at that, but correct me If i'm wrong, schema changes are
distributed to all nodes, and they all have to agree on a version, so I
can't have a keyspace A in DC1 with NetworkTopologyStrategy with options
= [{DC1:1,DC2:1}] and the same keyspace in DC2 with options [{DC2:1,
DC1:0}]. Is that correct?

 Aaron
 
 On 22/02/2011, at 3:10 AM, Héctor Izquierdo Seliva izquie...@strands.com 
 wrote:
 
  
  Hi all.
  
  Is there a way (besides changing the code) to replicate data from a Data
  center 1 to a Data center 2, but not the other way around? I need to
  have a preproduction environment with production data, and ideally with
  only a fraction of the data (for example, by key preffixes). I have
  poked around StorageProxy and I can make writes in DC2 not replicate to
  DC1, and as long as I use DC_QUORUM it stays that way, but it
  looks...dangerous. I could do a full key scan but it would take too
  long.
  
  Have anybody done something similar?
  
  Thanks!
  
  




Re: millions of columns in a row vs millions of rows with one column

2011-02-21 Thread Héctor Izquierdo Seliva
El mar, 22-02-2011 a las 08:49 +1300, Aaron Morton escribió:
 My preference is to go with more rows as it distributes load better. But the 
 best design is the one that supports your read patterns.
 
 See http://wiki.apache.org/cassandra/LargeDataSetConsiderations for 
 background.
 
 Aaron
 

those rows are distributed among the three replicas, so my thought was
that I could get away with it and have a more or less balanced cluster.
Anyway, the columns I read are not contiguous, so then the effect in I/O
is the same as having individual rows right? Cassandra still has to seek
to the position of the columns within the row. 

How much space does the key cache uses per row? This would make the
number of rows increase by a big factor.

 On 22/02/2011, at 3:56 AM, Héctor Izquierdo Seliva izquie...@strands.com 
 wrote:
 
  Hi Everyone.
  
  I'm testing performance differences of millions of columns in a row vs
  millions of rows. So far it seems wide rows perform better in terms of
  reads, but there can be potentially hundreds of millions of columns in a
  row. Is this going to be a problem? Should I go with individual rows? I
  run 6 nodes with 7.2 and a RF=3.
  
  Thanks for your help!
  




How to use NetworkTopologyStrategy

2011-02-18 Thread Héctor Izquierdo Seliva
Hi!

Can some body give me some hints about how to configure a keyspace with
NetworkTopologyStrategy via cassandra-cli? Or what is the preferred
method to do so?

Thanks!



Question about fat rows

2011-01-13 Thread Héctor Izquierdo Seliva
Hi everyone.

I have a question about data modeling in my application. I have to store
items of a customer, and I can do it in one fat row per customer where
the column name is the id and the value a json serialized object, or one
entry per item with the same layout. This data is updated almost every
day, sometimes several times per day.

My question is, which scheme will give me a better read performance? I
was hoping on saving keys so I could cache all the keys in this CF, but
I'm worried about read performance with very updated fat rows.

Any help or hints would be appreciated.

Thanks!



Re: Cassandra 0.7.0rc1 issue with command-cli

2010-11-26 Thread Héctor Izquierdo Seliva
Try ending the lines with ;

Regards

El vie, 26-11-2010 a las 21:25 +1100, jasonmp...@gmail.com escribió:
 Hi,
 
 So I had this working perfectly with beta 3 and now it fails.
 Basically what I do is follows:
 
 1) Extract new rc1 tarball.
 2) Prepare location based on instructions in Readme.txt:
 
 sudo rm -r /var/log/cassandra
 sudo rm -r /var/lib/cassandra
 sudo mkdir -p /var/log/cassandra
 sudo chown -R `whoami` /var/log/cassandra
 sudo mkdir -p /var/lib/cassandra
 sudo chown -R `whoami` /var/lib/cassandra
 
 3) Then run cassandra
 [devel...@localhost apache-cassandra-0.7.0-rc1]$ bin/cassandra -f
 
  INFO 21:23:41,750 Heap size: 1060569088/1061617664
  INFO 21:23:41,755 JNA not found. Native methods will be disabled.
  INFO 21:23:41,767 Loading settings from
 file:/opt/apache-cassandra-0.7.0-rc1/conf/cassandra.yaml
  INFO 21:23:41,942 DiskAccessMode 'auto' determined to be standard,
 indexAccessMode is standard
  INFO 21:23:42,055 Creating new commitlog segment
 /var/lib/cassandra/commitlog/CommitLog-1290767022055.log
  INFO 21:23:42,129 read 0 from saved key cache
  INFO 21:23:42,132 read 0 from saved key cache
  INFO 21:23:42,138 read 0 from saved key cache
  INFO 21:23:42,142 read 0 from saved key cache
  INFO 21:23:42,143 read 0 from saved key cache
  INFO 21:23:42,147 loading row cache for LocationInfo of system
  INFO 21:23:42,164 completed loading (16 ms; 0 keys)  row cache for
 LocationInfo of system
  INFO 21:23:42,164 loading row cache for HintsColumnFamily of system
  INFO 21:23:42,165 completed loading (1 ms; 0 keys)  row cache for
 HintsColumnFamily of system
  INFO 21:23:42,165 loading row cache for Migrations of system
  INFO 21:23:42,166 completed loading (1 ms; 0 keys)  row cache for
 Migrations of system
  INFO 21:23:42,168 loading row cache for Schema of system
  INFO 21:23:42,168 completed loading (0 ms; 0 keys)  row cache for
 Schema of system
  INFO 21:23:42,168 loading row cache for IndexInfo of system
  INFO 21:23:42,169 completed loading (1 ms; 0 keys)  row cache for
 IndexInfo of system
  INFO 21:23:42,257 Couldn't detect any schema definitions in local storage.
  INFO 21:23:42,258 Found table data in data directories. Consider
 using JMX to call
 org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
  INFO 21:23:42,260 No commitlog files found; skipping replay
  INFO 21:23:42,301 Upgrading to 0.7. Purging hints if there are any.
 Old hints will be snapshotted.
  INFO 21:23:42,306 Cassandra version: 0.7.0-rc1
  INFO 21:23:42,306 Thrift API version: 19.4.0
  INFO 21:23:42,320 Loading persisted ring state
  INFO 21:23:42,338 Starting up server gossip
  INFO 21:23:42,365 switching in a fresh Memtable for LocationInfo at
 CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1290767022055.log',
 position=700)
  INFO 21:23:42,367 Enqueuing flush of
 memtable-locationi...@14361585(227 bytes, 4 operations)
  INFO 21:23:42,367 Writing memtable-locationi...@14361585(227 bytes, 4
 operations)
  INFO 21:23:42,796 Completed flushing
 /var/lib/cassandra/data/system/LocationInfo-e-1-Data.db (473 bytes)
  WARN 21:23:42,861 Generated random token
 124937963426514930245885291999748186719. Random tokens will result in
 an unbalanced ring; see http://wiki.apache.org/cassandra/Operations
  INFO 21:23:42,863 switching in a fresh Memtable for LocationInfo at
 CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1290767022055.log',
 position=996)
  INFO 21:23:42,863 Enqueuing flush of
 memtable-locationi...@17243268(53 bytes, 2 operations)
  INFO 21:23:42,864 Writing memtable-locationi...@17243268(53 bytes, 2
 operations)
  INFO 21:23:43,277 Completed flushing
 /var/lib/cassandra/data/system/LocationInfo-e-2-Data.db (301 bytes)
  INFO 21:23:43,282 Will not load MX4J, mx4j-tools.jar is not in the classpath
  INFO 21:23:43,347 Binding thrift service to localhost/127.0.0.1:9160
  INFO 21:23:43,350 Using TFramedTransport with a max frame size of
 15728640 bytes.
  INFO 21:23:43,353 Listening for thrift clients...
 
 4) start the command line client:
 
 [devel...@localhost apache-cassandra-0.7.0-rc1]$ bin/cassandra-cli
 Welcome to cassandra CLI.
 
 Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
 [defa...@unknown] connect localhost/9160
 
 And as soon as I try and use the connect localhost/9160 it stalls
 
 I did this EXACT same procedure with beta3 and no issues.
 
 I am running on centos 5.5 with java 6
 
 Any ideas?




Re: Cassandra 0.7.0rc1 issue with command-cli

2010-11-26 Thread Héctor Izquierdo Seliva
Yes, I know what you mean. I bashed my head a few times against the
keyboard :)

El vie, 26-11-2010 a las 21:37 +1100, Jason Pell escribió:
 It works perfectly, thanks so much, I was finding it a little frustrating.
 
 On Fri, Nov 26, 2010 at 9:36 PM, Jason Pell ja...@pellcorp.com wrote:
  no way - well I certainly feel stupid!  Is this new, it worked without
  it on beta 3?
 
  2010/11/26 Héctor Izquierdo Seliva izquie...@strands.com:
  Try ending the lines with ;
 
  Regards
 
  El vie, 26-11-2010 a las 21:25 +1100, jasonmp...@gmail.com escribió:
  Hi,
 
  So I had this working perfectly with beta 3 and now it fails.
  Basically what I do is follows:
 
  1) Extract new rc1 tarball.
  2) Prepare location based on instructions in Readme.txt:
 
  sudo rm -r /var/log/cassandra
  sudo rm -r /var/lib/cassandra
  sudo mkdir -p /var/log/cassandra
  sudo chown -R `whoami` /var/log/cassandra
  sudo mkdir -p /var/lib/cassandra
  sudo chown -R `whoami` /var/lib/cassandra
 
  3) Then run cassandra
  [devel...@localhost apache-cassandra-0.7.0-rc1]$ bin/cassandra -f
 
   INFO 21:23:41,750 Heap size: 1060569088/1061617664
   INFO 21:23:41,755 JNA not found. Native methods will be disabled.
   INFO 21:23:41,767 Loading settings from
  file:/opt/apache-cassandra-0.7.0-rc1/conf/cassandra.yaml
   INFO 21:23:41,942 DiskAccessMode 'auto' determined to be standard,
  indexAccessMode is standard
   INFO 21:23:42,055 Creating new commitlog segment
  /var/lib/cassandra/commitlog/CommitLog-1290767022055.log
   INFO 21:23:42,129 read 0 from saved key cache
   INFO 21:23:42,132 read 0 from saved key cache
   INFO 21:23:42,138 read 0 from saved key cache
   INFO 21:23:42,142 read 0 from saved key cache
   INFO 21:23:42,143 read 0 from saved key cache
   INFO 21:23:42,147 loading row cache for LocationInfo of system
   INFO 21:23:42,164 completed loading (16 ms; 0 keys)  row cache for
  LocationInfo of system
   INFO 21:23:42,164 loading row cache for HintsColumnFamily of system
   INFO 21:23:42,165 completed loading (1 ms; 0 keys)  row cache for
  HintsColumnFamily of system
   INFO 21:23:42,165 loading row cache for Migrations of system
   INFO 21:23:42,166 completed loading (1 ms; 0 keys)  row cache for
  Migrations of system
   INFO 21:23:42,168 loading row cache for Schema of system
   INFO 21:23:42,168 completed loading (0 ms; 0 keys)  row cache for
  Schema of system
   INFO 21:23:42,168 loading row cache for IndexInfo of system
   INFO 21:23:42,169 completed loading (1 ms; 0 keys)  row cache for
  IndexInfo of system
   INFO 21:23:42,257 Couldn't detect any schema definitions in local 
  storage.
   INFO 21:23:42,258 Found table data in data directories. Consider
  using JMX to call
  org.apache.cassandra.service.StorageService.loadSchemaFromYaml().
   INFO 21:23:42,260 No commitlog files found; skipping replay
   INFO 21:23:42,301 Upgrading to 0.7. Purging hints if there are any.
  Old hints will be snapshotted.
   INFO 21:23:42,306 Cassandra version: 0.7.0-rc1
   INFO 21:23:42,306 Thrift API version: 19.4.0
   INFO 21:23:42,320 Loading persisted ring state
   INFO 21:23:42,338 Starting up server gossip
   INFO 21:23:42,365 switching in a fresh Memtable for LocationInfo at
  CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1290767022055.log',
  position=700)
   INFO 21:23:42,367 Enqueuing flush of
  memtable-locationi...@14361585(227 bytes, 4 operations)
   INFO 21:23:42,367 Writing memtable-locationi...@14361585(227 bytes, 4
  operations)
   INFO 21:23:42,796 Completed flushing
  /var/lib/cassandra/data/system/LocationInfo-e-1-Data.db (473 bytes)
   WARN 21:23:42,861 Generated random token
  124937963426514930245885291999748186719. Random tokens will result in
  an unbalanced ring; see http://wiki.apache.org/cassandra/Operations
   INFO 21:23:42,863 switching in a fresh Memtable for LocationInfo at
  CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1290767022055.log',
  position=996)
   INFO 21:23:42,863 Enqueuing flush of
  memtable-locationi...@17243268(53 bytes, 2 operations)
   INFO 21:23:42,864 Writing memtable-locationi...@17243268(53 bytes, 2
  operations)
   INFO 21:23:43,277 Completed flushing
  /var/lib/cassandra/data/system/LocationInfo-e-2-Data.db (301 bytes)
   INFO 21:23:43,282 Will not load MX4J, mx4j-tools.jar is not in the 
  classpath
   INFO 21:23:43,347 Binding thrift service to localhost/127.0.0.1:9160
   INFO 21:23:43,350 Using TFramedTransport with a max frame size of
  15728640 bytes.
   INFO 21:23:43,353 Listening for thrift clients...
 
  4) start the command line client:
 
  [devel...@localhost apache-cassandra-0.7.0-rc1]$ bin/cassandra-cli
  Welcome to cassandra CLI.
 
  Type 'help' or '?' for help. Type 'quit' or 'exit' to quit.
  [defa...@unknown] connect localhost/9160
 
  And as soon as I try and use the connect localhost/9160 it stalls
 
  I did this EXACT same procedure with beta3 and no issues.
 
  I am running on centos 5.5 with java 6
 
  Any ideas?
 
 
 
 




Re: cassandra-cli no command working - mac osx

2010-11-25 Thread Héctor Izquierdo Seliva
That happened to me too. Try with a ; at the end of the line.



El jue, 25-11-2010 a las 17:22 +, Marcin escribió:
 Hi guys,
 
 I am having weird problem, cassandra is working but can't get 
 cassandra-cli to work.
 
 When I run command - any command like even help and hit error I am not 
 getting any response any ideas?
 
 P.S.
 Running Cassandra 0.7.0 RC1
 
 cheers,
 /Marcin




Wide rows or tons of rows?

2010-10-11 Thread Héctor Izquierdo Seliva
Hi everyone.

I'm sure this question or similar has come up before, but I can't find a
clear answer. I have to store a unknown number of items in cassandra,
which can vary from a few hundreds to a few millions per customer.

I read that in cassandra wide rows are better than a lot of rows, but
then I face two problems. First, column distribution. The only way I can
think of distributing items among a given set of rows is hashing the
item id to a row id, and the using the item id as the column name. In
this way, I can distribute data among a few rows evenly, but If there
are only a few items it's equivalent to a row per item plus more
overhead, and if there are millions of items then the rows are to big,
and I have to turn off row cache. Does anybody knows a way around this? 

The second issue is that in my benchmarks, once the data is mmapped, one
item per row performs faster than wide rows by a significant margin. Is
this how it is supposed to be?

I can give additional data if needed. English is not my first language
so I apologize beforehand is some of this doesn't make sense.

Thanks for your time



Re: Wide rows or tons of rows?

2010-10-11 Thread Héctor Izquierdo Seliva
El lun, 11-10-2010 a las 11:08 -0400, Edward Capriolo escribió:

Inlined:

 2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com:
  Hi everyone.
 
  I'm sure this question or similar has come up before, but I can't find a
  clear answer. I have to store a unknown number of items in cassandra,
  which can vary from a few hundreds to a few millions per customer.
 
  I read that in cassandra wide rows are better than a lot of rows, but
  then I face two problems. First, column distribution. The only way I can
  think of distributing items among a given set of rows is hashing the
  item id to a row id, and the using the item id as the column name. In
  this way, I can distribute data among a few rows evenly, but If there
  are only a few items it's equivalent to a row per item plus more
  overhead, and if there are millions of items then the rows are to big,
  and I have to turn off row cache. Does anybody knows a way around this?
 
  The second issue is that in my benchmarks, once the data is mmapped, one
  item per row performs faster than wide rows by a significant margin. Is
  this how it is supposed to be?
 
  I can give additional data if needed. English is not my first language
  so I apologize beforehand is some of this doesn't make sense.
 
  Thanks for your time
 
 
 If you have wide rows RowCache is a problem. IMHO RowCache is only
 viable in situations where you have a fixed amount of data and thus
 will get a high hit rate. I was running a large row cache for some
 time and I found it unpredictable. It causes memory pressure on the
 JVM from moving things in and out of memory, and if the hit rate is
 low taking a key and all its columns in and out repeatedly ends up
 being counter productive for disk utilization. Suggest KeyCache in
 most situations, (there is a ticket opened for a fractional row cache)

I saw the same behavior. It's a pity there is not a column cache. That
would be awesome.

 Another factor to consider is if you have many rows and many columns
 you end up with large (er) indexes. In our case we have start up times
 slightly longer then we would like because the process of sampling
 indexes during start up is intensive. If I could do it all over again
 I might serialize more into single columns rather then exploding data
 across multiple rows and columns. If you always need to look up the
 entire row do not break it down by columns.

So it might be better to store a json serialized version then? I was
using SuperColumns to store item info, but a simple string might give me
the option to do some compression.

 memory mapping. There are different dynamics depending on data size
 relative to memory size. You may have something like ~ 40GB of data
 and 10GB index, 32GB RAM a node, this system is not going to respond
 the same way with say 200GB data 25 GB Indexes. Also it is very
 workload dependent.

We have a 6 node cluster with 16 GB RAM  each, although the whole
dataset is expected to be around 100GB per machine. Which indexes are
more expensive, row or column indexes?

 Hope this helps,
 Edward

It does!