Re: Cassandra read optimization

2012-04-19 Thread Dan Feldman
Hi Tyler and Aaron,

Thanks for your replies.

Tyler,
fetching scs using your pycassa script on our server takes ~7 s -
consistent with the times we've been seeing. Now, we aren't really experts
in Cassandra, but it seems that JNA is enabled by default for Cassandra 
1.0 according to Jeremy (
http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in case
it isn't, how do you turn it on in 1.0.8?

I'm also setting MAX_HEAP_SIZE=2G in cassandra-env.sh. I'm hoping that's
how you increase java heap size. I've tried 3G as well, without any
increase in performance. It did however allow for taking larger slices.

Aaron,
we are not doing multi-threaded requests for now, but we'll give it a shot
in the next day or two and I'll let you know if there is any improvement

Thanks for your help!
Dan F.


On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs ty...@datastax.com wrote:

 I tested this out with a small pycassa script:
 https://gist.github.com/2418598

 On my not-very-impressive laptop, I can read 5000 of the super columns in
 3 seconds (cold) or 1.5 (warm).  Reading in batches of 1000 super columns
 at a time gives much better performance; I definitely recommend going with
 a smaller batch size.

 Make sure that the timeout on your ConnectionPool isn't too low to handle
 a big request in pycassa.  If you turn on logging (as it is in the script I
 linked), you should be able to see if the request is timing out a couple of
 times before it succeeds.

 It might also be good to make sure that you've got JNA in place and your
 heap size is sufficient.


 On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.comwrote:

 On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com
 wrote:
  Hi all,
 
  I'm trying to optimize moving data from Cassandra to HDFS using either
 Ruby
  or Python client. Right now, I'm playing around on my staging server,
 an 8
  GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows
 (for
  now) with ~150k super columns each (I know, I know - super columns are
 bad).
  Every super column has ~25 columns totaling ~800 bytes per super column.
 
  I should also mention that currently the database is static - there are
 no
  writes/updates, only reads.
 
  Anyways, in my python/ruby scripts, I'm taking slices of 5000
 supercolumns
  long from a single row.  It takes 13 seconds with ruby and 8 seconds
 with
  pycassa to get a single slice. Or, in other words, it's currently
 reading at
  speeds of less than 500 kB per second. The speed seems to be linear
 with the
  length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
 nodetool
  cfstats while my script is running, it tells me that my read latency on
 the
  column family is ~300ms.
 
  I assume that this is not normal and thus was wondering what parameters
 I
  could tweak to improve the performance.
 

 Is your client mult-threaded?  The single threaded performance of
 Cassandra isn't at all impressive and it really is designed for
 dealing with a lot of simultaneous requests.


 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix 
 Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
 carpe diem quam minimum credula postero




 --
 Tyler Hobbs
 DataStax http://datastax.com/




Re: Cassandra read optimization

2012-04-19 Thread Dan Feldman
Hi Paolo,

Thanks for the hint - JNA indeed wasn't installed. However, now that
cassandra is actually using it, there doesn't seem to be any change in
terms of speed - still 7 seconds with pycassa.

On Thu, Apr 19, 2012 at 12:14 AM, Paolo Bernardi berna...@gmail.com wrote:

 Look into your Cassandra's logs to see if JNA is really enabled (it
 really should be, by default), and more importantly if JNA is loaded
 correctly. You might find some surprising message over there: if this
 is the case, just install JNA with your distro's package manager and,
 if still doesn't work, copy the JNA jar into Cassandra's lib directory
 (been there, done that).

 Paolo

 On Thu, Apr 19, 2012 at 8:26 AM, Dan Feldman hriunde...@gmail.com wrote:
  Hi Tyler and Aaron,
 
  Thanks for your replies.
 
  Tyler,
  fetching scs using your pycassa script on our server takes ~7 s -
 consistent
  with the times we've been seeing. Now, we aren't really experts in
  Cassandra, but it seems that JNA is enabled by default for Cassandra 
 1.0
  according to Jeremy
  (http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in
 case
  it isn't, how do you turn it on in 1.0.8?
 
  I'm also setting MAX_HEAP_SIZE=2G in cassandra-env.sh. I'm hoping
 that's
  how you increase java heap size. I've tried 3G as well, without any
  increase in performance. It did however allow for taking larger slices.
 
  Aaron,
  we are not doing multi-threaded requests for now, but we'll give it a
 shot
  in the next day or two and I'll let you know if there is any improvement
 
  Thanks for your help!
  Dan F.
 
 
 
  On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs ty...@datastax.com wrote:
 
  I tested this out with a small pycassa script:
  https://gist.github.com/2418598
 
  On my not-very-impressive laptop, I can read 5000 of the super columns
 in
  3 seconds (cold) or 1.5 (warm).  Reading in batches of 1000 super
 columns at
  a time gives much better performance; I definitely recommend going with
 a
  smaller batch size.
 
  Make sure that the timeout on your ConnectionPool isn't too low to
 handle
  a big request in pycassa.  If you turn on logging (as it is in the
 script I
  linked), you should be able to see if the request is timing out a
 couple of
  times before it succeeds.
 
  It might also be good to make sure that you've got JNA in place and your
  heap size is sufficient.
 
 
  On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.com
  wrote:
 
  On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com
  wrote:
   Hi all,
  
   I'm trying to optimize moving data from Cassandra to HDFS using
 either
   Ruby
   or Python client. Right now, I'm playing around on my staging server,
   an 8
   GB single node machine. My data in Cassandra (1.0.8) consist of 2
 rows
   (for
   now) with ~150k super columns each (I know, I know - super columns
 are
   bad).
   Every super column has ~25 columns totaling ~800 bytes per super
   column.
  
   I should also mention that currently the database is static - there
 are
   no
   writes/updates, only reads.
  
   Anyways, in my python/ruby scripts, I'm taking slices of 5000
   supercolumns
   long from a single row.  It takes 13 seconds with ruby and 8 seconds
   with
   pycassa to get a single slice. Or, in other words, it's currently
   reading at
   speeds of less than 500 kB per second. The speed seems to be linear
   with the
   length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
   nodetool
   cfstats while my script is running, it tells me that my read latency
 on
   the
   column family is ~300ms.
  
   I assume that this is not normal and thus was wondering what
 parameters
   I
   could tweak to improve the performance.
  
 
  Is your client mult-threaded?  The single threaded performance of
  Cassandra isn't at all impressive and it really is designed for
  dealing with a lot of simultaneous requests.
 
 
  --
  Aaron Turner
  http://synfin.net/ Twitter: @synfinatic
  http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix
 
  Windows
  Those who would give up essential Liberty, to purchase a little
 temporary
  Safety, deserve neither Liberty nor Safety.
  -- Benjamin Franklin
  carpe diem quam minimum credula postero
 
 
 
 
  --
  Tyler Hobbs
  DataStax
 
 



Re: Cassandra read optimization

2012-04-19 Thread Dan Feldman
We'll try doing multithreaded requests today-tomorrow

As for tuning down the number of supercolumns per slice, I tried doing
that, but I've noticed that the time was decreasing linearly with the
length of the slice. So, grabbing 1000 per slice would take 1/5 as long as
5000, but i'll have to make 5 times as many requests to the database and
that was the reason for choosing a pretty much random relatively large
number like 5000

Finally, forgive my newbiness, but what do you mean by use Hadoop to
export? Since the whole point of using these clients is to move cassandra
data to hdfs for future analysis, if there is a more direct way of doing
that, it would be perfect!

Dan F.

On Thu, Apr 19, 2012 at 4:24 AM, aaron morton aa...@thelastpickle.comwrote:

 Here's a test I did a while ago about creating column objects in python
 http://www.mail-archive.com/user@cassandra.apache.org/msg06729.html

 As Tyler said, the best approach is to limit the size of the slices.

 If are are trying to load 125K super columns with 25 columns each your are
 asking for roughly 3M columns. That is going to take a while to iterate
 over using a single client. If possible break the task up for multiple
 threads / processes.

 Is that a task you need to repeat ? Can you use Hadoop to export ?

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 19/04/2012, at 7:49 PM, Dan Feldman wrote:

 Hi Paolo,

 Thanks for the hint - JNA indeed wasn't installed. However, now that
 cassandra is actually using it, there doesn't seem to be any change in
 terms of speed - still 7 seconds with pycassa.

 On Thu, Apr 19, 2012 at 12:14 AM, Paolo Bernardi berna...@gmail.comwrote:

 Look into your Cassandra's logs to see if JNA is really enabled (it
 really should be, by default), and more importantly if JNA is loaded
 correctly. You might find some surprising message over there: if this
 is the case, just install JNA with your distro's package manager and,
 if still doesn't work, copy the JNA jar into Cassandra's lib directory
 (been there, done that).

 Paolo

 On Thu, Apr 19, 2012 at 8:26 AM, Dan Feldman hriunde...@gmail.com
 wrote:
  Hi Tyler and Aaron,
 
  Thanks for your replies.
 
  Tyler,
  fetching scs using your pycassa script on our server takes ~7 s -
 consistent
  with the times we've been seeing. Now, we aren't really experts in
  Cassandra, but it seems that JNA is enabled by default for Cassandra 
 1.0
  according to Jeremy
  (http://comments.gmane.org/gmane.comp.db.cassandra.user/21441). But in
 case
  it isn't, how do you turn it on in 1.0.8?
 
  I'm also setting MAX_HEAP_SIZE=2G in cassandra-env.sh. I'm hoping
 that's
  how you increase java heap size. I've tried 3G as well, without any
  increase in performance. It did however allow for taking larger slices.
 
  Aaron,
  we are not doing multi-threaded requests for now, but we'll give it a
 shot
  in the next day or two and I'll let you know if there is any improvement
 
  Thanks for your help!
  Dan F.
 
 
 
  On Wed, Apr 18, 2012 at 9:44 PM, Tyler Hobbs ty...@datastax.com
 wrote:
 
  I tested this out with a small pycassa script:
  https://gist.github.com/2418598
 
  On my not-very-impressive laptop, I can read 5000 of the super columns
 in
  3 seconds (cold) or 1.5 (warm).  Reading in batches of 1000 super
 columns at
  a time gives much better performance; I definitely recommend going
 with a
  smaller batch size.
 
  Make sure that the timeout on your ConnectionPool isn't too low to
 handle
  a big request in pycassa.  If you turn on logging (as it is in the
 script I
  linked), you should be able to see if the request is timing out a
 couple of
  times before it succeeds.
 
  It might also be good to make sure that you've got JNA in place and
 your
  heap size is sufficient.
 
 
  On Wed, Apr 18, 2012 at 8:59 PM, Aaron Turner synfina...@gmail.com
  wrote:
 
  On Wed, Apr 18, 2012 at 5:00 PM, Dan Feldman hriunde...@gmail.com
  wrote:
   Hi all,
  
   I'm trying to optimize moving data from Cassandra to HDFS using
 either
   Ruby
   or Python client. Right now, I'm playing around on my staging
 server,
   an 8
   GB single node machine. My data in Cassandra (1.0.8) consist of 2
 rows
   (for
   now) with ~150k super columns each (I know, I know - super columns
 are
   bad).
   Every super column has ~25 columns totaling ~800 bytes per super
   column.
  
   I should also mention that currently the database is static - there
 are
   no
   writes/updates, only reads.
  
   Anyways, in my python/ruby scripts, I'm taking slices of 5000
   supercolumns
   long from a single row.  It takes 13 seconds with ruby and 8 seconds
   with
   pycassa to get a single slice. Or, in other words, it's currently
   reading at
   speeds of less than 500 kB per second. The speed seems to be linear
   with the
   length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
   nodetool
   cfstats while my script is running, it tells

Cassandra read optimization

2012-04-18 Thread Dan Feldman
Hi all,

I'm trying to optimize moving data from Cassandra to HDFS using either Ruby
or Python client. Right now, I'm playing around on my staging server, an 8
GB single node machine. My data in Cassandra (1.0.8) consist of 2 rows (for
now) with ~150k super columns each (I know, I know - super columns are
bad). Every super column has ~25 columns totaling ~800 bytes per super
column.

I should also mention that currently the database is static - there are no
writes/updates, only reads.

Anyways, in my python/ruby scripts, I'm taking slices of 5000 supercolumns
long from a single row.  It takes 13 seconds with ruby and 8 seconds with
pycassa to get a single slice. Or, in other words, it's currently reading
at speeds of less than 500 kB per second. The speed seems to be linear with
the length of a slice (i.e. 6 seconds for 2500 scs for ruby). If I run
nodetool cfstats while my script is running, it tells me that my read
latency on the column family is ~300ms.

I assume that this is not normal and thus was wondering what parameters I
could tweak to improve the performance.

Thanks,
Dan F.