On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman <hriunde...@gmail.com> wrote: > Hi all, > > I'm trying to optimize moving data from Cassandra to HDFS using either Ruby > or Python client. Right now, I'm playing around on my staging server, an 8 > GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for > now) with ~150k super columns each (I know, I know - super columns are bad). > Every super column has ~25 columns totaling ~800 bytes per super column. > > I should also mention that currently the database is static - there are no > writes/updates, only reads. > > Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns > long from a single row. It takes 13 seconds with ruby and 8 seconds with > pycassa to get a single slice. Or, in other words, it's currently reading at > speeds of less than 500 kB per second. The speed seems to be linear with the > length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool > cfstats while my script is running, it tells me that my read latency on the > column family is ~300ms. > > I assume that this is not normal and thus was wondering what parameters I > could tweak to improve the performance. >
Is your client mult-threaded? The single threaded performance of Cassandra isn't at all impressive and it really is designed for dealing with a lot of simultaneous requests. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin "carpe diem quam minimum credula postero"