C* users,
We have a process that loads a large batch of rows from Cassandra into
many separate compute workers. The rows are one-column wide and range in
size for a couple KB to ~100 MB. After manipulating the data for a while,
each compute worker writes the data back with *new* row keys computed by
the workers (UUIDs).
After the full batch is written back to new rows, a cleanup worker deletes
the old rows.
After several cycles, pycassa starts getting connection failures.
Should we use a pycassa listener to catch these failures and just recreate
the ConnectionPool and keep going as if the connection had not dropped?
Or is there a better approach?
These failures happen on just a simple single-node setup with a total data
set less than half the size of Java heap space, e.g. 2GB data (times two
for the two copies during cycling) versus 8GB heap. We tried reducing
memtable_flush_queue_size to 2 so that it would flush the deletes faster,
and also tried multithreaded_compaction=true, but still pycassa gets
connection failures.
Is this expected before for shedding load? Or is this unexpected?
Would things be any different if we used multiple nodes and scaled the
data and worker count to match? I mean, is there something inherent to
cassandra's operating model that makes it want to always have multiple
nodes?
Thanks for pointers,
John