Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column.
I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Thanks, Dan F.