Hi all,

I'm trying to optimize moving data from Cassandra to HDFS using either Ruby
or Python client. Right now, I'm playing around on my staging server, an 8
GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for
now) with ~150k super columns each (I know, I know - super columns are
bad). Every super column has ~25 columns totaling ~800 bytes per super
column.

I should also mention that currently the database is static - there are no
writes/updates, only reads.

Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns
long from a single row.  It takes 13 seconds with ruby and 8 seconds with
pycassa to get a single slice. Or, in other words, it's currently reading
at speeds of less than 500 kB per second. The speed seems to be linear with
the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
nodetool cfstats while my script is running, it tells me that my read
latency on the column family is ~300ms.

I assume that this is not normal and thus was wondering what parameters I
could tweak to improve the performance.

Thanks,
Dan F.

Reply via email to