Re: Cassandra read optimization
Hi Tyler and Aaron, Thanks for your replies. Tyler, fetching scs using your pycassa script on our server takes ~7 s - consistent with the times we've been seeing. Now, we aren't really experts in Cassandra, but it seems that JNA is enabled by default for Cassandra 1.0 according to Jeremy ( http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in case it isn't, how do you turn it on in 1.0.8? I'm also setting MAX_HEAP_SIZE=2G in cassandra-env.sh. I'm hoping that's how you increase java heap size. I've tried 3G as well, without any increase in performance. It did however allow for taking larger slices. Aaron, we are not doing multi-threaded requests for now, but we'll give it a shot in the next day or two and I'll let you know if there is any improvement Thanks for your help! Dan F. On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs ty...@datastax.com wrote: I tested this out with a small pycassa script: https://gist.github.com/2418598 On my not-very-impressive laptop, I can read 5000 of the super columns in 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super columns at a time gives much better performance; I definitely recommend going with a smaller batch size. Make sure that the timeout on your ConnectionPool isn't too low to handle a big request in pycassa. If you turn on logging (as it is in the script I linked), you should be able to see if the request is timing out a couple of times before it succeeds. It might also be good to make sure that you've got JNA in place and your heap size is sufficient. On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.comwrote: On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote: Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Is your client mult-threaded? The single threaded performance of Cassandra isn't at all impressive and it really is designed for dealing with a lot of simultaneous requests. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero -- Tyler Hobbs DataStax http://datastax.com/
Re: Cassandra read optimization
Hi Paolo, Thanks for the hint - JNA indeed wasn't installed. However, now that cassandra is actually using it, there doesn't seem to be any change in terms of speed - still 7 seconds with pycassa. On Thu, Apr 19, 2012 at 12:14 AM, Paolo Bernardi berna...@gmail.com wrote: Look into your Cassandra's logs to see if JNA is really enabled (it really should be, by default), and more importantly if JNA is loaded correctly. You might find some surprising message over there: if this is the case, just install JNA with your distro's package manager and, if still doesn't work, copy the JNA jar into Cassandra's lib directory (been there, done that). Paolo On Thu, Apr 19, 2012 at 8:26 AM, Dan Feldman hriunde...@gmail.com wrote: Hi Tyler and Aaron, Thanks for your replies. Tyler, fetching scs using your pycassa script on our server takes ~7 s - consistent with the times we've been seeing. Now, we aren't really experts in Cassandra, but it seems that JNA is enabled by default for Cassandra 1.0 according to Jeremy (http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in case it isn't, how do you turn it on in 1.0.8? I'm also setting MAX_HEAP_SIZE=2G in cassandra-env.sh. I'm hoping that's how you increase java heap size. I've tried 3G as well, without any increase in performance. It did however allow for taking larger slices. Aaron, we are not doing multi-threaded requests for now, but we'll give it a shot in the next day or two and I'll let you know if there is any improvement Thanks for your help! Dan F. On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs ty...@datastax.com wrote: I tested this out with a small pycassa script: https://gist.github.com/2418598 On my not-very-impressive laptop, I can read 5000 of the super columns in 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super columns at a time gives much better performance; I definitely recommend going with a smaller batch size. Make sure that the timeout on your ConnectionPool isn't too low to handle a big request in pycassa. If you turn on logging (as it is in the script I linked), you should be able to see if the request is timing out a couple of times before it succeeds. It might also be good to make sure that you've got JNA in place and your heap size is sufficient. On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.com wrote: On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote: Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Is your client mult-threaded? The single threaded performance of Cassandra isn't at all impressive and it really is designed for dealing with a lot of simultaneous requests. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin carpe diem quam minimum credula postero -- Tyler Hobbs DataStax
Re: Cassandra read optimization
We'll try doing multithreaded requests today-tomorrow As for tuning down the number of supercolumns per slice, I tried doing that, but I've noticed that the time was decreasing linearly with the length of the slice. So, grabbing 1000 per slice would take 1/5 as long as 5000, but i'll have to make 5 times as many requests to the database and that was the reason for choosing a pretty much random relatively large number like 5000 Finally, forgive my newbiness, but what do you mean by use Hadoop to export? Since the whole point of using these clients is to move cassandra data to hdfs for future analysis, if there is a more direct way of doing that, it would be perfect! Dan F. On Thu, Apr 19, 2012 at 4:24 AM, aaron morton aa...@thelastpickle.comwrote: Here's a test I did a while ago about creating column objects in python http://www.mail-archive.com/user@cassandra.apache.org/msg06729.html As Tyler said, the best approach is to limit the size of the slices. If are are trying to load 125K super columns with 25 columns each your are asking for roughly 3M columns. That is going to take a while to iterate over using a single client. If possible break the task up for multiple threads / processes. Is that a task you need to repeat ? Can you use Hadoop to export ? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 19/04/2012, at 7:49 PM, Dan Feldman wrote: Hi Paolo, Thanks for the hint - JNA indeed wasn't installed. However, now that cassandra is actually using it, there doesn't seem to be any change in terms of speed - still 7 seconds with pycassa. On Thu, Apr 19, 2012 at 12:14 AM, Paolo Bernardi berna...@gmail.comwrote: Look into your Cassandra's logs to see if JNA is really enabled (it really should be, by default), and more importantly if JNA is loaded correctly. You might find some surprising message over there: if this is the case, just install JNA with your distro's package manager and, if still doesn't work, copy the JNA jar into Cassandra's lib directory (been there, done that). Paolo On Thu, Apr 19, 2012 at 8:26 AM, Dan Feldman hriunde...@gmail.com wrote: Hi Tyler and Aaron, Thanks for your replies. Tyler, fetching scs using your pycassa script on our server takes ~7 s - consistent with the times we've been seeing. Now, we aren't really experts in Cassandra, but it seems that JNA is enabled by default for Cassandra 1.0 according to Jeremy (http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in case it isn't, how do you turn it on in 1.0.8? I'm also setting MAX_HEAP_SIZE=2G in cassandra-env.sh. I'm hoping that's how you increase java heap size. I've tried 3G as well, without any increase in performance. It did however allow for taking larger slices. Aaron, we are not doing multi-threaded requests for now, but we'll give it a shot in the next day or two and I'll let you know if there is any improvement Thanks for your help! Dan F. On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs ty...@datastax.com wrote: I tested this out with a small pycassa script: https://gist.github.com/2418598 On my not-very-impressive laptop, I can read 5000 of the super columns in 3 seconds (cold) or 1.5 (warm). Reading in batches of 1000 super columns at a time gives much better performance; I definitely recommend going with a smaller batch size. Make sure that the timeout on your ConnectionPool isn't too low to handle a big request in pycassa. If you turn on logging (as it is in the script I linked), you should be able to see if the request is timing out a couple of times before it succeeds. It might also be good to make sure that you've got JNA in place and your heap size is sufficient. On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.com wrote: On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com wrote: Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells
Cassandra read optimization
Hi all, I'm trying to optimize moving data from Cassandra to HDFS using either Ruby or Python client. Right now, I'm playing around on my staging server, an 8 GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for now) with ~150k super columns each (I know, I know - super columns are bad). Every super column has ~25 columns totaling ~800 bytes per super column. I should also mention that currently the database is static - there are no writes/updates, only reads. Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns long from a single row. It takes 13 seconds with ruby and 8 seconds with pycassa to get a single slice. Or, in other words, it's currently reading at speeds of less than 500 kB per second. The speed seems to be linear with the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run nodetool cfstats while my script is running, it tells me that my read latency on the column family is ~300ms. I assume that this is not normal and thus was wondering what parameters I could tweak to improve the performance. Thanks, Dan F.