C* users,

We have a process that loads a large batch of rows from Cassandra into many separate compute workers. The rows are one-column wide and range in size for a couple KB to ~100 MB. After manipulating the data for a while, each compute worker writes the data back with *new* row keys computed by the workers (UUIDs).

After the full batch is written back to new rows, a cleanup worker deletes the old rows.

After several cycles, pycassa starts getting connection failures.

Should we use a pycassa listener to catch these failures and just recreate the ConnectionPool and keep going as if the connection had not dropped? Or is there a better approach?

These failures happen on just a simple single-node setup with a total data set less than half the size of Java heap space, e.g. 2GB data (times two for the two copies during cycling) versus 8GB heap. We tried reducing memtable_flush_queue_size to 2 so that it would flush the deletes faster, and also tried multithreaded_compaction=true, but still pycassa gets connection failures.

Is this expected before for shedding load?  Or is this unexpected?

Would things be any different if we used multiple nodes and scaled the data and worker count to match? I mean, is there something inherent to cassandra's operating model that makes it want to always have multiple nodes?

Thanks for pointers,
John

Reply via email to