Re: Install Cassandra on EC2

2011-08-03 Thread Dave Viner
Hi Eldad,

Check out http://wiki.apache.org/cassandra/CloudConfig
There are a few ways listed there including a step-by-step guide.

Dave Viner


On Wed, Aug 3, 2011 at 7:49 AM, Eldad Yamin elda...@gmail.com wrote:

 Thanks!
 But I prefer to learn how to Install first - if you have any good
 references (I didn't find any, even general installation for a EC2/regular
 machine)
 I'm also going to try and install Solandra, I hope that Whirr will support
 it in the near future.

 On Wed, Aug 3, 2011 at 5:43 PM, John Conwell j...@iamjohn.me wrote:

 One thing you might want to look at is the Apache Whirr project (which is
 awesome by the way!).  It automagically handles spinning up a cluster of
 resources on EC2 (or rackspace for that matter), installing and configuring
 cassandra, and starting it.

 One thing to be aware of if you go this route.  By default in the yaml
 file all data is written under the /var folder.  But on a server started by
 Whirr, this folder only has something like 4gb.  Most of the  hard disk
 space is under the /mnt folder.  So you'll either need to change what
 folders are pointed to what drives (not sure if you can or not...I'm sure
 you could), or change the yaml file to point the /mnt folder.


 On Wed, Aug 3, 2011 at 6:28 AM, Eldad Yamin elda...@gmail.com wrote:

 Hi,
 Is there any manual or important notes I should know before I try to
 install Cassandra on EC2?

 Thanks!




 --

 Thanks,
 John C





Re: LB scenario

2011-04-05 Thread Dave Viner
AJ,

One issue that I found in using load balancer in front of cassandra nodes is
that a single node might become bogged down by compaction, or other actions
unrelated to the client.  If the load balancer does not pick this up in
time, it might route client requests to the node that is temporarily
overloaded.

In practice, I've found it better for the client to have a pool of
connections, and then retry as needed to distinct nodes rather than use a
load balancer.

HTH
Dave Viner


On Tue, Apr 5, 2011 at 9:51 AM, A J s5a...@gmail.com wrote:

 Can someone comment on this ? Or is the question too vague ?

 Thanks.

 On Wed, Mar 30, 2011 at 3:58 PM, A J s5a...@gmail.com wrote:
  Does the following load balancing scenario look reasonable with cassandra
 ?
  I will not be having any app servers.
 
  http://dl.dropbox.com/u/7258508/2011-03-30_1542.png
 
  Thanks.
 



Re: what kind of bug?

2011-03-23 Thread Dave Viner
I saw this once when my servers ran out of file descriptors.  This caused
totally weird problems.

Make sure all nodes in the cluster are listening on the gossip port (7000 by
default).

Also check out
http://www.datastax.com/docs/0.7/troubleshooting/index#view-of-ring-differs-between-some-nodesor
http://www.datastax.com/docs/0.6/troubleshooting/index#view-of-ring-differs-between-some-nodes

depending on your version.

On Wed, Mar 23, 2011 at 11:39 AM, Aaron Morton aa...@thelastpickle.comwrote:

 First thing is check the logs on host 1. Check the view of the ring from
 all other nodes in the cluster, do they think nodes 2 and 3 are also down?
 Then confirm all nodes have the same config for listen port and all nodes
 can telnet to the listen port for the other nodes.

 I'm guessing the insert fails for some inserts because you are working at
 Quorum and your replication factor is less than 5.

 Aaron
 On 23/03/2011, at 11:31 PM, pob peterob...@gmail.com wrote:

  Hello,
 
  what kind of bug is it?
 
 
  If I do nodetool host1 ring, the output is:
 
  Address Status State   LoadOwnsToken
 
  141784319550391026443072753096570088105
  1.174  Up Normal  4.14 GB 16.67%  0
  1.173  Down   Normal  4.07 GB 16.67%
  28356863910078205288614550619314017621
  1.172  Down   Normal  4.1 GB  16.67%
  56713727820156410577229101238628035242
  1.179  Up Normal  4.05 GB 16.67%
  85070591730234615865843651857942052863
  1.175  Up Normal  4.13 GB 16.67%
  113427455640312821154458202477256070484
  1.177  Up Normal  4.12 GB 16.67%
  141784319550391026443072753096570088105
 
 
  but if I do nodetool host3 ring, the output is:
 
  Address Status State   LoadOwnsToken
 
  141784319550391026443072753096570088105
  1.174  Up Normal  4.14 GB 16.67%  0
  1.173  Up   Normal  4.07 GB 16.67%
  28356863910078205288614550619314017621
  1.172  Up   Normal  4.1 GB  16.67%
  56713727820156410577229101238628035242
  1.179  Up Normal  4.05 GB 16.67%
  85070591730234615865843651857942052863
  1.175  Up Normal  4.13 GB 16.67%
  113427455640312821154458202477256070484
  1.177  Up Normal  4.12 GB 16.67%
  141784319550391026443072753096570088105
 
 
  Some nodes see some nodes Down, and its impossible to do correct insert.
 The cassandra is running on nodes that is matched down.
 
 
  Any ideas? Thanks
 
 
  Best,
  Peter
 
 



Re: EC2 - 2 regions

2011-03-18 Thread Dave Viner
Hi AJ,

I'd suggest getting to a multi-region cluster step-by-step.  First, get 2
nodes running in the same availability zone.  Make sure that works properly.
 Second, add a node in a separate availability zone, but in the same region.
 Make sure that's working properly.  Third, add a node that's in a separate
region.

Taking it step-by-step will ensure that any issues are specific to the
region-to-region communication, rather than intra-zone connectivity or
cassandra cluster configuration.

Dave Viner


On Fri, Mar 18, 2011 at 8:34 AM, A J s5a...@gmail.com wrote:

 Hello,

 I am trying to setup a cassandra cluster across regions.
 For testing I am keeping it simple and just having one node in US-EAST
 (say ec2-1-2-3-4.compute-1.amazonaws.com) and one node in US-WEST (say
 ec2-2-2-3-4.us-west-1.compute.amazonaws.com).
 Using Cassandra 0.7.4


 The one in east region is the seed node and has the values as:
 auto_bootstrap: false
 seeds: ec2-1-2-3-4.compute-1.amazonaws.com
 listen_address: ec2-1-2-3-4.compute-1.amazonaws.com
 rpc_address: 0.0.0.0

 The one in west region is non seed and has the values as:
 auto_bootstrap: true
 seeds: ec2-1-2-3-4.compute-1.amazonaws.com
 listen_address: ec2-2-2-3-4.us-west-1.compute.amazonaws.com
 rpc_address: 0.0.0.0

 I first fire the seed node (east region instance) and it comes up
 without issues.
 When I fire the non-seed node (west region instance) it fails after
 sometime with the error:

 DEBUG 15:09:08,844 Created HHOM instance, registered MBean.
  INFO 15:09:08,844 Joining: getting load information
  INFO 15:09:08,845 Sleeping 9 ms to wait for load information...
 DEBUG 15:09:09,822 attempting to connect to
 ec2-1-2-3-4.compute-1.amazonaws.com/1.2.3.4
 DEBUG 15:09:10,825 Disseminating load info ...
 DEBUG 15:10:10,826 Disseminating load info ...
 DEBUG 15:10:38,845 ... got load info
  INFO 15:10:38,845 Joining: getting bootstrap token
 ERROR 15:10:38,847 Exception encountered during startup.
 java.lang.RuntimeException: No other nodes seen!  Unable to bootstrap
at
 org.apache.cassandra.dht.BootStrapper.getBootstrapSource(BootStrapper.java:164)
at
 org.apache.cassandra.dht.BootStrapper.getBalancedToken(BootStrapper.java:146)
at
 org.apache.cassandra.dht.BootStrapper.getBootstrapToken(BootStrapper.java:141)
at
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:450)
at
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:404)
at
 org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:192)
at
 org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
at
 org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)


 The seed node seems to somewhat acknowledge the non-seed node:
 attempting to connect to /2.2.3.4
 attempting to connect to /10.170.190.31

 Can you suggest how can I fix it (I did see a few threads on similar
 issue but did not really follow the chain)

 Thanks, AJ



Re: EC2 - 2 regions

2011-03-18 Thread Dave Viner
From the us-west instance, are you able to connect to the us-east instance
using telnet on port 7000 and 9160?

If not, then you need to open those ports for communication (via your
Security Group)

Dave Viner

On Fri, Mar 18, 2011 at 10:20 AM, A J s5a...@gmail.com wrote:

 Thats exactly what I am doing.

 I was able to do the first two scenarios without any issues (i.e. 2
 nodes in same availability zone. Followed by an additional node in a
 different zone but same region)

 I am stuck at the third scenario of separate regions.

 (I did read the Cassandra nodes on EC2 in two different regions not
 communicating thread but it did not seem to end with resolution)


 On Fri, Mar 18, 2011 at 1:15 PM, Dave Viner davevi...@gmail.com wrote:
  Hi AJ,
  I'd suggest getting to a multi-region cluster step-by-step.  First, get 2
  nodes running in the same availability zone.  Make sure that works
 properly.
   Second, add a node in a separate availability zone, but in the same
 region.
   Make sure that's working properly.  Third, add a node that's in a
 separate
  region.
  Taking it step-by-step will ensure that any issues are specific to the
  region-to-region communication, rather than intra-zone connectivity or
  cassandra cluster configuration.
  Dave Viner
 
  On Fri, Mar 18, 2011 at 8:34 AM, A J s5a...@gmail.com wrote:
 
  Hello,
 
  I am trying to setup a cassandra cluster across regions.
  For testing I am keeping it simple and just having one node in US-EAST
  (say ec2-1-2-3-4.compute-1.amazonaws.com) and one node in US-WEST (say
  ec2-2-2-3-4.us-west-1.compute.amazonaws.com).
  Using Cassandra 0.7.4
 
 
  The one in east region is the seed node and has the values as:
  auto_bootstrap: false
  seeds: ec2-1-2-3-4.compute-1.amazonaws.com
  listen_address: ec2-1-2-3-4.compute-1.amazonaws.com
  rpc_address: 0.0.0.0
 
  The one in west region is non seed and has the values as:
  auto_bootstrap: true
  seeds: ec2-1-2-3-4.compute-1.amazonaws.com
  listen_address: ec2-2-2-3-4.us-west-1.compute.amazonaws.com
  rpc_address: 0.0.0.0
 
  I first fire the seed node (east region instance) and it comes up
  without issues.
  When I fire the non-seed node (west region instance) it fails after
  sometime with the error:
 
  DEBUG 15:09:08,844 Created HHOM instance, registered MBean.
   INFO 15:09:08,844 Joining: getting load information
   INFO 15:09:08,845 Sleeping 9 ms to wait for load information...
  DEBUG 15:09:09,822 attempting to connect to
  ec2-1-2-3-4.compute-1.amazonaws.com/1.2.3.4
  DEBUG 15:09:10,825 Disseminating load info ...
  DEBUG 15:10:10,826 Disseminating load info ...
  DEBUG 15:10:38,845 ... got load info
   INFO 15:10:38,845 Joining: getting bootstrap token
  ERROR 15:10:38,847 Exception encountered during startup.
  java.lang.RuntimeException: No other nodes seen!  Unable to bootstrap
 at
 
 org.apache.cassandra.dht.BootStrapper.getBootstrapSource(BootStrapper.java:164)
 at
 
 org.apache.cassandra.dht.BootStrapper.getBalancedToken(BootStrapper.java:146)
 at
 
 org.apache.cassandra.dht.BootStrapper.getBootstrapToken(BootStrapper.java:141)
 at
 
 org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:450)
 at
 
 org.apache.cassandra.service.StorageService.initServer(StorageService.java:404)
 at
 
 org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:192)
 at
 
 org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
 at
 
 org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)
 
 
  The seed node seems to somewhat acknowledge the non-seed node:
  attempting to connect to /2.2.3.4
  attempting to connect to /10.170.190.31
 
  Can you suggest how can I fix it (I did see a few threads on similar
  issue but did not really follow the chain)
 
  Thanks, AJ
 
 



Re: cassandra as user-profile data store

2011-03-01 Thread Dave Viner
Hi Dave,

Glad to hear others are using it in this fashion!

Are you using Tyler's suggested strategy for user-profile data - one CF that
stores the timeline, with rows of user-ids, and TimeUUID columns for each
data-collection-time.  Then some post-processing with Hadoop over the
timelines for each user to build a Profile?

Are you on 0.7 or 0.6.x?

Dave Viner


On Tue, Mar 1, 2011 at 1:31 AM, Dave Gardner dave.gard...@visualdna.comwrote:

 Dave

 Tyler's answer already covers CFs etc..

 We are using Cassandra to store user profile data for exactly the sort of
 use case you describe. We don't yet store _all_ the data in Cassandra;
 currently we are focusing on the stuff we need available for real-time
 access. We use Hadoop to analyse the profiles from within Cassandra.

 Dave


 On 23 February 2011 23:21, Dave Viner davevi...@gmail.com wrote:

 Hi all,

 I'm wondering if anyone has used cassandra as a datastore for a
 user-profile service.  I'm thinking of applications like behavioral
 targeting, where there are lots  lots of users (10s to 100s of millions),
 and lots  lots of data about them intermixed in, say, weblogs (probably TBs
 worth).  The idea would be to use Cassandra as a datastore for distributed
 parallel processing of the TBs of files (say on hadoop).  Then the resulting
 user-profiles would be query-able quickly.

 Anyone know of that sort of application of Cassandra?  I'm trying to
 puzzle out just what the column family might look like.  Seems like a mix of
 time-oriented information (user x visits site y at time z), location
 information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
 120.10923), and derived information (because user x visited site y 15 times
 within a 10 day window, user x must be interested in buying a car).

 I don't have specifics as yet... just some general thoughts.  But this
 feels like a Cassandra type problem.  (User profile can have lots of columns
 per user, but the exact columns might differ from user to user... very
 scalable, etc)

 Thanks
 Dave Viner





Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-24 Thread Dave Viner
Another possibility is this:

why not setup 2 nodes in 1 region in 1 az, and get that to work.
Then, open a third node in the same region, but different AZ, and get that
to work.
Then, once you have that working, open a fourth node in a different region
and get that to work.

Seems like taking a piece-meal approach would be beneficial here.

Dave Viner


On Thu, Feb 24, 2011 at 6:11 AM, Daniel van Ham Colchete 
daniel.colch...@gmail.com wrote:

 Himanshi,

 my bad, try this for iptables:

 # SNAT outgoing connections
 iptables -t nat -A POSTROUTING -p tcp --dport 7000 -d 175.41.143.192 -j
 SNAT --to-source INTERNALIP

 As for tcpdump the argument for the -i option is the interface name (eth0,
 cassth0, etc...), and not the IP. So, it should be
 tcpdump -i cassth0 -n port 7000
 or
 tcpdump -i eth0 -n port 7000

 I`m assuming your main network card is eth0, but that should be the case.

 Does it work?

 Best,
 Daniel


 On Thu, Feb 24, 2011 at 9:27 AM, Himanshi Sharma 
 himanshi.sha...@tcs.comwrote:


 Thanks Daniel.

 But SNAT command is not working and when i try tcpdump it gives

 [root@ip-10-136-75-201 ~]# tcpdump -i 50.18.60.117 -n port 7000
 tcpdump: Invalid adapter index

 Not able to figure out wats this ??

 Thanks,
 Himanshi



  From: Daniel van Ham Colchete daniel.colch...@gmail.com To:
 user@cassandra.apache.org Date: 02/24/2011 04:27 PM Subject: Re:
 Cassandra nodes on EC2 in two different regions not communicating
 --



 Himanshi,

 you could try adding your public IP address to an internal interface and
 DNAT the packets to it. This shouldn't give you any problems with your
 normal traffic. Tell Cassandra on listen on the public IPs and it should
 work.

 Linux commands would be:

 # Create an internal interface using bridge-utils
 brctl addbr cassth0

 # add the ip
 ip addr add dev cassth0 *50.18.60.117/32* http://50.18.60.117/32

 # DNAT incoming connections
 iptables -t nat -A PREROUTING -p tcp --dport 7000 -d INTERNALIP -j DNAT
 --to-destination 50.18.60.117

 # SNAT outgoing connections
 iptables -t nat -A OUTPUT -p tcp --dport 7000 -d 175.41.143.192 -j SNAT
 --to-source INTERNALIP

 This should work since Amazon you re-SNAT your outgoing packets to your
 public IP again, so the other cassandra instance will see your public IP as
 your source address.

 I didn't test this setup here but it should work unless I forgot some
 small detail. If you need to troubleshoot use the command tcpdump -i
 INTERFACE -n port 7000 where INTERFACE should be your public interface or
 your cassth0.

 Please let me know if it worked.

 Best regards,
 Daniel Colchete

 On Thu, Feb 24, 2011 at 4:04 AM, Himanshi Sharma *
 himanshi.sha...@tcs.com* himanshi.sha...@tcs.com wrote:
 giving private ip to rpc address gives the same exception
 and the keeping it blank and providing public to listen also fails. I
 tried keeping both blank and did telnet on 7000 so i get following o/p

 [root@ip-10-166-223-150 bin]# telnet 122.248.193.37 7000
 Trying 122.248.193.37...
 Connected to 122.248.193.37.
 Escape character is '^]'.

 Similarly from another achine

 [root@ip-10-136-75-201 bin]# telnet 184.72.22.87 7000
 Trying 184.72.22.87...
 Connected to 184.72.22.87.
 Escape character is '^]'.



 -Dave Viner wrote: -
 To: *user@cassandra.apache.org* user@cassandra.apache.org
 From: Dave Viner *davevi...@gmail.com* davevi...@gmail.com
 Date: 02/24/2011 11:59AM
 cc: Himanshi Sharma *himanshi.sha...@tcs.com* himanshi.sha...@tcs.com

 Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating

 Try using the private ipv4 address in the rpc_address field, and the
 public ipv4 (NOT the elastic ip) in the listen_address.

 If that fails, go back to rpc_address empty, and start up cassandra.

 Then from the other node, please telnet to port 7000 on the first node.
  And show the output of that session in your reply.

 I haven't actually constructed a cross-region cluster nor have I used
 v0.7, but this really sounds like it should be easy.

 On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma  *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com wrote:
 Hi Dave,

 I tried with the public ips. If i mention the public ip in rpc address
 field, Cassandra gives the same exception but if leave it blank then
 Cassandra runs but again in the nodetool command with ring option it does'nt
 show the node in another region.

 Thanks,
 Himanshi


 -Dave Viner wrote: -
 To: *user@cassandra.apache.org * user@cassandra.apache.org
 From: Dave Viner  *davevi...@gmail.com * davevi...@gmail.com
 Date: 02/24/2011 10:43AM

 Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating

 That looks like it's not an issue of communicating between nodes.  It
 appears that the node can not bind to the address on the localhost that
 you're asking for.

  java.net.BindException: Cannot assign requested address  

 I think the issue is that the Elastic IP address is not actually

Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-23 Thread Dave Viner
Try using the IP address, not the dns name in the cassandra.yaml.

If you can telnet from one to the other on port 7000, and both nodes have
the other node in their config, it should work.

Dave Viner


On Wed, Feb 23, 2011 at 1:43 AM, Himanshi Sharma himanshi.sha...@tcs.comwrote:


 Ya they do. Have specified Public DNS in seed field of each node in
 Cassandra.yaml...nt able to figure out what the problem is ???




  From: Sasha Dolgy sdo...@gmail.com To: user@cassandra.apache.org Date: 
 02/23/2011
 02:56 PM Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating
 --



 did you define the other host in the cassandra.yaml ?  on both servers 
 they need to know about each other

 On Wed, Feb 23, 2011 at 10:16 AM, Himanshi Sharma *
 himanshi.sha...@tcs.com* himanshi.sha...@tcs.com wrote:

 Thanks Dave but I am able to telnet to other instances on port 7000
 and when i run  ./nodetool --host *
 ec2-50-18-60-117.us-west-1.compute.amazonaws.com*http://ec2-50-18-60-117.us-west-1.compute.amazonaws.com/
  ring... I can see only one node.

 Do we need to configure anything else in Cassandra.yaml or Cassandra-env.sh
 ???






   From: Dave Viner *davevi...@gmail.com* davevi...@gmail.com  To: *
 user@cassandra.apache.org* user@cassandra.apache.org  Cc: Himanshi
 Sharma *himanshi.sha...@tcs.com* himanshi.sha...@tcs.com  Date: 02/23/2011
 11:36 AM  Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating

  --



 If you login to one of the nodes, can you telnet to port 7000 on the other
 node?

 If not, then almost certainly it's a firewall/Security Group issue.

 You can find out the security groups for any node by logging in, and then
 running:

 % curl 
 *http://169.254.169.254/latest/meta-data/security-groups*http://169.254.169.254/latest/meta-data/security-groups


 Assuming that both nodes are in the same security group, ensure that the SG
 is configured to allow other members of the SG to communicate on port 7000
 to each other.

 HTH,
 Dave Viner


 On Tue, Feb 22, 2011 at 8:59 PM, Himanshi Sharma *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com wrote:

 Hi,

 I am new to Cassandra. I m running Cassandra on EC2. I configured Cassandra
 cluster on two instances in different regions.
 But when I am trying the nodetool command with ring option, I am getting
 only single node.

 How to make these two nodes communicate with each other. I have already
 opened required ports. i.e 7000, 8080, 9160 in respective
 security groups. Plz help me with this.

 Regards,
 Himanshi Sharma


 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are

 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,

 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you




 =-=-=


 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the


 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message


 and any attachments. Thank you





 --
 Sasha Dolgy*
 **sasha.do...@gmail.com* sasha.do...@gmail.com

 =-=-=
 Notice: The information contained in this e-mail
 message and/or attachments to it may contain
 confidential or privileged information. If you are
 not the intended recipient, any dissemination, use,
 review, distribution, printing or copying of the
 information contained in this e-mail message
 and/or attachments to it are strictly prohibited. If
 you have received this communication in error,
 please notify us by reply e-mail or telephone and
 immediately and permanently delete the message
 and any attachments. Thank you





cassandra as user-profile data store

2011-02-23 Thread Dave Viner
Hi all,

I'm wondering if anyone has used cassandra as a datastore for a user-profile
service.  I'm thinking of applications like behavioral targeting, where
there are lots  lots of users (10s to 100s of millions), and lots  lots of
data about them intermixed in, say, weblogs (probably TBs worth).  The idea
would be to use Cassandra as a datastore for distributed parallel processing
of the TBs of files (say on hadoop).  Then the resulting user-profiles would
be query-able quickly.

Anyone know of that sort of application of Cassandra?  I'm trying to puzzle
out just what the column family might look like.  Seems like a mix of
time-oriented information (user x visits site y at time z), location
information (user x appeared from ip x.y.z.a which is geo-location 31.20309,
120.10923), and derived information (because user x visited site y 15 times
within a 10 day window, user x must be interested in buying a car).

I don't have specifics as yet... just some general thoughts.  But this feels
like a Cassandra type problem.  (User profile can have lots of columns per
user, but the exact columns might differ from user to user... very scalable,
etc)

Thanks
Dave Viner


Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-23 Thread Dave Viner
That looks like it's not an issue of communicating between nodes.  It
appears that the node can not bind to the address on the localhost that
you're asking for.

java.net.BindException: Cannot assign requested address 

I think the issue is that the Elastic IP address is not actually an IP
address that's on the localhost.  So the daemon can not bind to that IP.
 Instead of using the EIP, use the local IP address for the rpc_address (i
think that's what you need since that is what Thrift will bind to).  Then
for the listen_address should be the ip address that is routable from the
other node.  I would first try with the actual public IP address (not the
Elastic IP).  Once you get that to work, then shutdown the cluster, change
the listen_address to the EIP, boot up and try again.

Dave Viner


On Wed, Feb 23, 2011 at 8:54 PM, Himanshi Sharma himanshi.sha...@tcs.comwrote:


 Hey Dave,

 Sorry i forgot to mention the Non-seed configuration.

 for first node in us-west its as belowi.e its own elastic ip

 listen_address: 50.18.60.117
 rpc_address: 50.18.60.117

 and for second node in ap-southeast-1 its as belowi.e again its own
 elastic ip

 listen_address: 175.41.143.192
 rpc_address: 175.41.143.192

 Thanks,
 Himanshi





  From:
 Dave Viner davevi...@gmail.com
 To: user@cassandra.apache.org Date: 02/23/2011 11:01 PM Subject: Re:
 Cassandra nodes on EC2 in two different regions not communicating
 --



 internal EC2 ips (10.xxx.xxx.xxx) work across availability zones (e.g.,
 from us-east-1a to us-east-1b) but do not work across regions (e.g., us-east
 to us-west).  To do regions, you must use the public ip address assigned by
 amazon.

 Himanshi, when you log into 1 node, and telnet to port 7000 on the other
 node, which IP address did you use - the 10.x address or the public ip
 address?
 And what is the seed/non-seed configuration in both cassandra.yaml files?

 Dave Viner


 On Wed, Feb 23, 2011 at 8:12 AM, Frank LoVecchio 
 *fr...@isidorey.com*fr...@isidorey.com
 wrote:
 The internal Amazon IP address is what you will want to use so you don't
 have to go through DNS anyways; not sure if this works from US-East to
 US-West, but it does make things quicker in between zones, e.g. us-east-1a
 to us-east-1b.


 On Wed, Feb 23, 2011 at 9:09 AM, Dave Viner 
 *davevi...@gmail.com*davevi...@gmail.com
 wrote:
 Try using the IP address, not the dns name in the cassandra.yaml.

 If you can telnet from one to the other on port 7000, and both nodes have
 the other node in their config, it should work.

 Dave Viner


 On Wed, Feb 23, 2011 at 1:43 AM, Himanshi Sharma *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com wrote:

 Ya they do. Have specified Public DNS in seed field of each node in
 Cassandra.yaml...nt able to figure out what the problem is ???



   From: Sasha Dolgy *sdo...@gmail.com* sdo...@gmail.com  To: *
 user@cassandra.apache.org* user@cassandra.apache.org  Date: 02/23/2011
 02:56 PM  Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating

  --



 did you define the other host in the cassandra.yaml ?  on both servers 
 they need to know about each other

 On Wed, Feb 23, 2011 at 10:16 AM, Himanshi Sharma *
 himanshi.sha...@tcs.com* himanshi.sha...@tcs.com wrote:

 Thanks Dave but I am able to telnet to other instances on port 7000
 and when i run  ./nodetool --host *
 ec2-50-18-60-117.us-west-1.compute.amazonaws.com*http://ec2-50-18-60-117.us-west-1.compute.amazonaws.com/
  ring... I can see only one node.

 Do we need to configure anything else in Cassandra.yaml or Cassandra-env.sh
 ???





   From: Dave Viner *davevi...@gmail.com* davevi...@gmail.com  To: *
 user@cassandra.apache.org* user@cassandra.apache.org  Cc: Himanshi
 Sharma *himanshi.sha...@tcs.com* himanshi.sha...@tcs.com  Date: 02/23/2011
 11:36 AM  Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating


  --



 If you login to one of the nodes, can you telnet to port 7000 on the other
 node?

 If not, then almost certainly it's a firewall/Security Group issue.

 You can find out the security groups for any node by logging in, and then
 running:

 % curl 
 *http://169.254.169.254/latest/meta-data/security-groups*http://169.254.169.254/latest/meta-data/security-groups


 Assuming that both nodes are in the same security group, ensure that the SG
 is configured to allow other members of the SG to communicate on port 7000
 to each other.

 HTH,
 Dave Viner


 On Tue, Feb 22, 2011 at 8:59 PM, Himanshi Sharma *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com wrote:

 Hi,

 I am new to Cassandra. I m running Cassandra on EC2. I configured Cassandra
 cluster on two instances in different regions.
 But when I am trying the nodetool command with ring option, I am getting
 only single node.

 How to make these two nodes communicate with each other. I have already
 opened required ports. i.e 7000

Re: Cassandra nodes on EC2 in two different regions not communicating

2011-02-23 Thread Dave Viner
Try using the private ipv4 address in the rpc_address field, and the public
ipv4 (NOT the elastic ip) in the listen_address.

If that fails, go back to rpc_address empty, and start up cassandra.

Then from the other node, please telnet to port 7000 on the first node.  And
show the output of that session in your reply.

I haven't actually constructed a cross-region cluster nor have I used v0.7,
but this really sounds like it should be easy.

On Wed, Feb 23, 2011 at 10:22 PM, Himanshi Sharma
himanshi.sha...@tcs.comwrote:

 Hi Dave,

 I tried with the public ips. If i mention the public ip in rpc address
 field, Cassandra gives the same exception but if leave it blank then
 Cassandra runs but again in the nodetool command with ring option it does'nt
 show the node in another region.

 Thanks,
 Himanshi


 -Dave Viner wrote: -

 To: user@cassandra.apache.org
 From: Dave Viner davevi...@gmail.com
 Date: 02/24/2011 10:43AM

 Subject: Re: Cassandra nodes on EC2 in two different regions not
 communicating

 That looks like it's not an issue of communicating between nodes.  It
 appears that the node can not bind to the address on the localhost that
 you're asking for.

  java.net.BindException: Cannot assign requested address  

 I think the issue is that the Elastic IP address is not actually an IP
 address that's on the localhost.  So the daemon can not bind to that IP.
  Instead of using the EIP, use the local IP address for the rpc_address (i
 think that's what you need since that is what Thrift will bind to).  Then
 for the listen_address should be the ip address that is routable from the
 other node.  I would first try with the actual public IP address (not the
 Elastic IP).  Once you get that to work, then shutdown the cluster, change
 the listen_address to the EIP, boot up and try again.

 Dave Viner


 On Wed, Feb 23, 2011 at 8:54 PM, Himanshi Sharma  himanshi.sha...@tcs.com
  wrote:


 Hey Dave,

 Sorry i forgot to mention the Non-seed configuration.

 for first node in us-west its as belowi.e its own elastic ip

 listen_address: 50.18.60.117
 rpc_address: 50.18.60.117

 and for second node in ap-southeast-1 its as belowi.e again its own
 elastic ip

 listen_address: 175.41.143.192
 rpc_address: 175.41.143.192

 Thanks,
 Himanshi





   From:
 Dave Viner  davevi...@gmail.com 
  To: user@cassandra.apache.org  Date: 02/23/2011 11:01 PM  Subject: Re:
 Cassandra nodes on EC2 in two different regions not communicating
 --



 internal EC2 ips (10.xxx.xxx.xxx) work across availability zones (e.g.,
 from us-east-1a to us-east-1b) but do not work across regions (e.g., us-east
 to us-west).  To do regions, you must use the public ip address assigned by
 amazon.

 Himanshi, when you log into 1 node, and telnet to port 7000 on the other
 node, which IP address did you use - the 10.x address or the public ip
 address?
 And what is the seed/non-seed configuration in both cassandra.yaml files?

 Dave Viner


 On Wed, Feb 23, 2011 at 8:12 AM, Frank LoVecchio  *fr...@isidorey.com 
 *fr...@isidorey.com
 wrote:
 The internal Amazon IP address is what you will want to use so you don't
 have to go through DNS anyways; not sure if this works from US-East to
 US-West, but it does make things quicker in between zones, e.g. us-east-1a
 to us-east-1b.


 On Wed, Feb 23, 2011 at 9:09 AM, Dave Viner  *davevi...@gmail.com 
 *davevi...@gmail.com
 wrote:
 Try using the IP address, not the dns name in the cassandra.yaml.

 If you can telnet from one to the other on port 7000, and both nodes have
 the other node in their config, it should work.

 Dave Viner


 On Wed, Feb 23, 2011 at 1:43 AM, Himanshi Sharma  *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com wrote:

 Ya they do. Have specified Public DNS in seed field of each node in
 Cassandra.yaml...nt able to figure out what the problem is ???



   From: Sasha Dolgy  *sdo...@gmail.com * sdo...@gmail.com  To: 
 *user@cassandra.apache.org
 * user@cassandra.apache.org Date: 02/23/2011 02:56 PM  Subject: Re:
 Cassandra nodes on EC2 in two different regions not communicating

 --



 did you define the other host in the cassandra.yaml ?  on both servers
  they need to know about each other

 On Wed, Feb 23, 2011 at 10:16 AM, Himanshi Sharma  *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com wrote:

 Thanks Dave but I am able to telnet to other instances on port 7000
 and when i run  ./nodetool --host 
 *ec2-50-18-60-117.us-west-1.compute.amazonaws.com
 * http://ec2-50-18-60-117.us-west-1.compute.amazonaws.com/ ring... I
 can see only one node.

 Do we need to configure anything else in Cassandra.yaml or
 Cassandra-env.sh ???





   From: Dave Viner  *davevi...@gmail.com * davevi...@gmail.com  To: 
 *user@cassandra.apache.org
 * user@cassandra.apache.org Cc: Himanshi Sharma  *himanshi.sha...@tcs.com
 * himanshi.sha...@tcs.com  Date: 02/23/2011 11:36 AM  Subject: Re:
 Cassandra nodes on EC2 in two different

quick shout-out to the riptano/datastax folks!

2011-02-02 Thread Dave Viner
Just a quick shout-out to the riptano folks and becoming part of/forming
DataStax!

Congrats!


Re: Upgrading from 0.6 to 0.7.0

2011-01-21 Thread Dave Viner
I agree.  I am running a 0.6 cluster and would like to upgrade to 0.7.  But,
I can not simply stop my existing nodes.

I need a way to load a new cluster - either on the same machines or new
machines - with the existing data.

I think my overall preference would be to upgrade the cluster to 0.7 running
on a new port (or new set of machines), then have a tiny translation service
on the old port which did whatever translation is required from 0.6 protocol
to 0.7 protocol.

Then I would upgrade my clients once to the 0.7 protocol and also change
their connection parameters to the new 0.7 cluster.

But, I'd be open to anything ... just need a way to upgrade without having
to turn everything off, do the upgrade, then turn everything back on.  I am
not able to do that in my production environment (for business reasons).
 Docs on alternatives other than turn off, upgrade, turn on would be
fantastic.

Dave Viner


On Fri, Jan 21, 2011 at 1:01 PM, Aaron Morton aa...@thelastpickle.comwrote:

 Yup, you can use diff ports and you can give them different cluster names
 and different seed lists.

 After you upgrade the second cluster partition the data should repair
 across, either via RR or the HHs that were stored while the first partition
 was down. Easiest thing would be to run node tool repair. Then a clean up to
 remove any leftover data.

 AFAIK file formats are compatible. But drain the nodes before upgrading to
 clear the log.

 Can you test this on a non production system?

 Aaron
 (we really need to write some upgrade docs:))

 On 21/01/2011, at 10:42 PM, Dave Gardner dave.gard...@imagini.net wrote:

 What about executing writes against both clusters during the changeover?
 Interested in this topic because we're currently thinking about the same
 thing - how to upgrade to 0.7 without any interruption.

 Dave

 On 21 January 2011 09:20, Daniel Josefsson  jid...@gmail.com
 jid...@gmail.com wrote:

 No, what I'm thinking of is having two clusters (0.6 and 0.7) running on
 different ports so they can't find each other. Or isn't that configurable?

 Then, when I have the two clusters, I could upgrade all of the clients to
 run against the new cluster, and finally upgrade the rest of the Cassandra
 nodes.

 I don't know how the new cluster would cope with having new data in the
 old cluster when they are upgraded though.

 /Daniel

 2011/1/20 Aaron Morton  aa...@thelastpickle.comaa...@thelastpickle.com
 

 I'm not sure if your suggesting running a mixed mode cluster there, but
 AFAIK the changes to the internode protocol prohibit this. The nodes will
 probable see each either via gossip, but the way the messages define their
 purpose (their verb handler) has been changed.

 Out of interest which is more painful, stopping the cluster and upgrading
 it or upgrading your client code?

 Aaron

 On 21/01/2011, at 12:35 AM, Daniel Josefsson  jid...@gmail.com
 jid...@gmail.com wrote:

 In our case our replication factor is more than half the number of nodes
 in the cluster.

 Would it be possible to do the following:

- Upgrade half of them
- Change Thrift Port and inter-server port (is this the
storage_port?)
- Start them up
- Upgrade clients one by one
- Upgrade the the rest of the servers

 Or might we get some kind of data collision when still writing to the old
 cluster as the new storage is being used?

 /Daniel






Re: Cassandra automatic startup script on ubuntu

2011-01-20 Thread Dave Viner
You can also use the apt-get repository version, which installs the startup
script.  On http://wiki.apache.org/cassandra/CloudConfig, see the Cassandra
Basic Setup section.  It applies to any debian based machine, not just cloud
instances.

HTH
Dave Viner

On Thu, Jan 20, 2011 at 9:11 AM, Donal Zang zan...@ihep.ac.cn wrote:

  On 20/01/2011 17:51, Sébastien Druon wrote:

 Hello!

  I am using cassandra on a ubuntu machine and installed it from the binary
 found on the cassandra home page.
 However, I did not find any scripts to start it up at boot time.

  Where can I find this kind of script?

  Thanks a lot in advance

  Sebastien

 Hi, this is what I do, you can add the watchdog to rc.local
 *%S[%m]%s %~ %# cat watchdog
 #!/bin/bash
 #
 # This script is to check every $INTERVAL seconds to see
 # whether cassandra is work well
 # and restart it if neccesary
 # by donal 2010-01-11
 #
 PORT=9160
 INTERVAL=2
 CASSANDRA=/opt/cassandra
 check() {
 netstat -tln|grep LISTEN|grep :$1
 if [ $? != 0 ]; then
 echo restarting cassandra
 $CASSANDRA/bin/stop-server
 sleep 1
 $CASSANDRA/bin/start-server
 fi
 }
 while true
   do check $PORT
   sleep $INTERVAL
 done*




Re: Do you have a site in production environment with Cassandra? What client do you use?

2011-01-15 Thread Dave Viner
Perl using the thrift interface directly.

On Sat, Jan 15, 2011 at 6:10 AM, Daniel Lundin d...@eintr.org wrote:

 python + pycassa
 scala + Hector

 On Fri, Jan 14, 2011 at 6:24 PM, Ertio Lew ertio...@gmail.com wrote:
  Hey,
 
  If you have a site in production environment or considering so, what
  is the client that you use to interact with Cassandra. I know that
  there are several clients available out there according to the
  language you use but I would love to know what clients are being used
  widely in production environments and are best to work with(support
  most required features for performance).
 
  Also preferably tell about the technology stack for your applications.
 
  Any suggestions, comments appreciated ?
 
  Thanks
  Ertio
 



anyone using Cassandra as an analytics/data warehouse?

2011-01-04 Thread Dave Viner
Does anyone use Cassandra to power an analytics or data warehouse
implementation?

As a concrete example, one could imagine Cassandra storing data for
something that reports on page-views on a website.  The basic notions might
be simple (url as row-key and columns as timeuuids of viewers).  But, how
would one store things like ip-geolocation to set of pages viewed?  Or
hour-of-day to pages viewed?

Also, how would one do a query like
- tell me how many page views occurred between 12/01/2010 and 12/31/2010?
- tell me how many page views occurred between 12/01/2010 and 12/31/2010
from the US?
- tell me how many page views occurred between 12/01/2010 and 12/31/2010
from the US in the 9th hour of the day (in gmt)?

Time slicing and dimension slicing seems like it might be very challenging
(especially since the windows of time would not be known in advance).

Thanks
Dave Viner


Re: anyone using Cassandra as an analytics/data warehouse?

2011-01-04 Thread Dave Viner
Hi Peter,

Thanks.  These are great ideas.  One comment tho.  I'm actually not as
worried about the logging into the system performance and more
speculating/imagining the querying out of the system.

Most traditional data warehouses have a cube or a star schema or something
similar.  I'm trying to imagine how one might use Cassandra in situations
where that sort of design has historically been applied.

But, I want to make sure I understand your suggestion A.

Is it something like this?

a Column Family with the row key being the Unix time divided by 60x60 and a
column key of... pretty much anything unique
LogCF[hour-day-in-epoch-seconds][timeuuid] = 1
where 'hour-day-in-epoch-seconds' is something like the first second of the
given hour of the day, so 01/04/2011 19:00:00 (in epoch
seconds: 1294167600); 'timeuuid' is a TimeUUID from cassandra, and '1' is
the value of the entry.

Then look at the current row every hour to actually compile the numbers,
and store the count in the same Column Family
LogCF[hour-day-in-epoch-seconds][total] = x
where 'x' is the sum of the number of timeuuid columns in the row?


Is that what you're envisioning in Option A?

Thanks
Dave Viner



On Tue, Jan 4, 2011 at 6:38 PM, Peter Harrison cheetah...@gmail.com wrote:

 Okay, here is two ways to handle this, both are quite different from each
 other.


 A)

 This approach does not depend on counters. You simply have a Column Family
 with the row key being the Unix time divided by 60x60 and a column key of...
 pretty much anything unique. Then have another process look at the current
 row every hour to actually compile the numbers, and store the count in the
 same Column Family. This will solve the first and third use cases, as it is
 just a matter of looking at the right rows. The second case will require a
 similar index, but one which includes a country code to be appended to the
 row key.

 The downside here is that you are storing lots of data on individual
 requests and retaining it. If you don't want the detailed data you might add
 a second process to purge the detail every hour.

 B)

 There is a counter feature added to the latest versions of Cassandra. I
 have not used them, but they should be able to be used to achieve the same
 effect without a second process cleaning up every hour. Also means it is
 more of a real time system so you can see how many requests in the hour you
 are currently in.



 Basically you have to design your approach based on the query you will be
 doing. Don't get too hung up on traditional data structures and queries as
 they have little relationship to a Cassandra approach.



 On Wed, Jan 5, 2011 at 2:34 PM, Dave Viner davevi...@gmail.com wrote:

 Does anyone use Cassandra to power an analytics or data warehouse
 implementation?

 As a concrete example, one could imagine Cassandra storing data for
 something that reports on page-views on a website.  The basic notions might
 be simple (url as row-key and columns as timeuuids of viewers).  But, how
 would one store things like ip-geolocation to set of pages viewed?  Or
 hour-of-day to pages viewed?

 Also, how would one do a query like
 - tell me how many page views occurred between 12/01/2010 and
 12/31/2010?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US?
 - tell me how many page views occurred between 12/01/2010 and 12/31/2010
 from the US in the 9th hour of the day (in gmt)?

 Time slicing and dimension slicing seems like it might be very challenging
 (especially since the windows of time would not be known in advance).

 Thanks
 Dave Viner





Re: Does Cassandra run better on Amazon EC2 or Rackspace cloud servers?

2011-01-03 Thread Dave Viner
Since it's all pay-for-use, you could build your system on both, then do
whatever stress testing you want.

The cassandra part of your app should be unchanged between different cloud
providers.

Personally, I'm using EC2 and don't have any complaints.

Dave Viner


On Mon, Jan 3, 2011 at 3:49 PM, Ryan King r...@twitter.com wrote:

 On Mon, Jan 3, 2011 at 3:04 PM, Cassy Andra cassandral...@gmail.com
 wrote:
  My company is looking to develop a software prototype based off Cassandra
 in
  the cloud. We except to run 5 - 10 NoSQL servers for the prototype. I've
  read online (Jonathan Ellis was pretty vocal about this) that EC2 has
 some
  I/O issues. Is the general consensus to run Cassandra on EC2 or
 Rackspace?
  What are the pros + cons?

 I don't know about RAX cloud, but Joe Stump of SimpleGeo did some
 benchmarks of ec2 io performance:

 http://stu.mp/2009/12/disk-io-and-throughput-benchmarks-on-amazons-ec2.html

 -ryan



Re: Virtual IP / hardware load balancing for cassandra nodes

2010-12-20 Thread Dave Viner
You can put a Cassandra cluster behind a load balancer.  One thing to be
cautious of is the health check.  Just because the node is listening on port
9160 doesn't mean that it's healthy to serve requests.  It is required, but
not sufficient.

The real test is the JMX values.

Dave Viner


On Mon, Dec 20, 2010 at 6:25 AM, Jonathan Colby jonathan.co...@gmail.comwrote:

 I was unable to find example or documentation on my question.  I'd like to
 know what the best way to group a cluster of cassandra nodes behind a
 virtual ip.

 For example, can cassandra nodes be placed behind a Citrix Netscaler
 hardware load balancer?

 I can't imagine it being a problem, but in doing so would you break any
 cassandra functionality?

 The goal is to have the application talk to a single virtual ip  and be
 directed to a random node in the cluster.

 I heard a little about adding the node addresses to Hector's load-balancing
 mechanism, but this doesn't seem too robust or easy to maintain.

 Thanks in advance.


Re: Cassandra Monitoring

2010-12-19 Thread Dave Viner
How does mx4j compare with the earlier jmx-to-rest bridge listed in the
operations page:

JMX-to-REST bridge available at
http://code.google.com/p/polarrose-jmx-rest-bridge;

Thanks
Dave Viner


On Sun, Dec 19, 2010 at 7:01 AM, Ran Tavory ran...@gmail.com wrote:

 FYI, I just added an mx4j section to the bottom of this page
 http://wiki.apache.org/cassandra/Operations


 On Sun, Dec 19, 2010 at 4:30 PM, Jonathan Ellis jbel...@gmail.com wrote:

 mx4j? https://issues.apache.org/jira/browse/CASSANDRA-1068


 On Sun, Dec 19, 2010 at 8:36 AM, Peter Schuller 
 peter.schul...@infidyne.com wrote:

  How / what are you monitoring? Best practices someone?

 I recently set up monitoring using the cassandra-munin-plugins
 (https://github.com/jamesgolick/cassandra-munin-plugins). However, due
 to various little details that wasn't too fun to integrate properly
 with munin-node-configure and automated configuration management. A
 problem is also the starting of a JVM for each use of jmxquery, which
 can become a problem with many column families.

 I like your web server idea. Something persistent that can sit there
 and do the JMX acrobatics, and expose something more easily consumed
 for stuff like munin/zabbix/etc. It would be pretty nice to have that
 out of the box with Cassandra, though I expect that would be
 considered bloat. :)

 --
 / Peter Schuller




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com




 --
 /Ran




Re: Cassandra Monitoring

2010-12-19 Thread Dave Viner
Can you share the code for run_column_family_stores.sh ?

On Sun, Dec 19, 2010 at 6:14 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Sun, Dec 19, 2010 at 2:01 PM, Ran Tavory ran...@gmail.com wrote:
  Mx4j is in process, same jvm, you just need to throw mx4j-tools.jar in
  the lib before you start Cassandra jmx-to-rest runs in a separate jvm.
   It also has a nice useful HTML interface that you can look into any
  running host.
 
  On Sunday, December 19, 2010, Dave Viner davevi...@gmail.com wrote:
  How does mx4j compare with the earlier jmx-to-rest bridge listed in the
 operations page:
  JMX-to-REST bridge available at
 http://code.google.com/p/polarrose-jmx-rest-bridge;
 
  ThanksDave Viner
 
 
  On Sun, Dec 19, 2010 at 7:01 AM, Ran Tavory ran...@gmail.com wrote:
  FYI, I just added an mx4j section to the bottom of this page
 http://wiki.apache.org/cassandra/Operations
 
 
  On Sun, Dec 19, 2010 at 4:30 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
  mx4j? https://issues.apache.org/jira/browse/CASSANDRA-1068
 
 
 
 
  On Sun, Dec 19, 2010 at 8:36 AM, Peter Schuller 
 peter.schul...@infidyne.com wrote:
  How / what are you monitoring? Best practices someone?
 
  I recently set up monitoring using the cassandra-munin-plugins
  (https://github.com/jamesgolick/cassandra-munin-plugins). However, due
  to various little details that wasn't too fun to integrate properly
  with munin-node-configure and automated configuration management. A
  problem is also the starting of a JVM for each use of jmxquery, which
  can become a problem with many column families.
 
  I like your web server idea. Something persistent that can sit there
  and do the JMX acrobatics, and expose something more easily consumed
  for stuff like munin/zabbix/etc. It would be pretty nice to have that
  out of the box with Cassandra, though I expect that would be
  considered bloat. :)
 
  --
  / Peter Schuller
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of Riptano, the source for professional Cassandra support
  http://riptano.com
 
 
  --
  /Ran
 
 
 
 
 
  --
  /Ran
 

 There is a lot of overhead on your monitoring station to kick up so
 many JMX connections. There can also be nat/hostname problems for
 remote JMX.

 My solution is to execute JMX over nagios remote plugin executor (NRPE).

 command[run_column_family_stores]=/usr/lib64/nagios/plugins/run_column_family_stores.sh
 $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$

 Maybe not as fancy as a rest-jmx bridge, but solves most of the RMI
 issues involved in pulling stats over JMX,



Re: Facebook messaging and choice of HBase over Cassandra - what can we learn?

2010-11-21 Thread Dave Viner
I don't know the details of operation of HBase, so I can't speak on that
point.  But, I do know that Facebook hired Jonathan Grey, former CTO of
Streamy, who is a huge HBase contributor. Streamy ended in Mar 2010 -
although I'm not sure when he went to work for Facebook.

He presented on HBase at the Hadoop conference in October in NYC:
http://mpouttuclarke.wordpress.com/2010/10/18/notes-from-hadoop-world-2010-nyc/

Again, I don't know the chronology (whether he was hired before the decision
to use hbase or after).  But I know that Jonathan is a fantastically smart
(and extremely nice) guy and I'm sure he could make HBase bend to his will
at any point.

Dave Viner

On Sun, Nov 21, 2010 at 4:16 PM, Todd Lipcon t...@lipcon.org wrote:

 On Sun, Nov 21, 2010 at 2:06 PM, Edward Ribeiro 
 edward.ribe...@gmail.comwrote:


 Also I believe saying HBASE is consistent is not true. This can happen:
 Write to region server. - Region Server acknowledges client- write
 to WAL - region server fails = write lost

 I wonder how facebook will reconcile that. :)


 Are you sure about that? Client writes to WAL before ack user?

 According to these posts[1][2], if writing the record to the WAL fails
 the whole operation must be considered a failure., so it would be nonsense
 acknowledge clients before writing the lifeline. I hope any cloudera guy
 explain this...


 [only jumping in because info was requested - those who know me know that I
 think Cassandra is a very interesting architecture and a better fit for many
 applications than HBase]

 You can operate the commit log in two different modes in HBase. One mode is
 deferred log flush, where the region server appends but does not sync()
 the commit log to HDFS on every write, but rather on a periodic basis (eg
 once a second). This is similar to the innodb_flush_log_at_trx_commit=2
 option in MySQL for example. This has slightly better performance obviously
 since the writer doesn't need to wait on the commit, but as you noted
 there's a window where a write may be acknowledged but then lost. This is an
 issue of *durability* moreso than consistency.

 In the other mode of operation (default in recent versions of HBase) we do
 not acknowledge a write until it has been pushed to the OS buffer on the
 entire pipeline of log replicas. Obviously this is slower, but it results in
 no lost data regardless of any machine failures. Additionally, concurrent
 readers do not see written data until these same properties have been
 satisfied. So this mode is 100% consistent and 100% durable. In practice,
 this effects latency significantly since it adds two extra round trips to
 each write, but system throughput is only reduced by 20-30% since the
 commits are pipelined (see HDFS-895 for gory details)

 I believe Cassandra has similar tuning options about whether to sync every
 commit to the log or only do so periodically.

 If you're interested in learning more, feel free to reference this
 documentation:
 http://hbase.apache.org/docs/r0.89.20100726/acid-semantics.html



 Besides that, you know that WAL is written to HDFS that takes care of
 replication and fault tolerance, right? Of course, even so, there's a
 window of inconsistency before the HLog is flushed to disk, but I don't
 think you can dismiss this as not consistent. At most, you may classify it
 as eventual consistent. :)

 [1] http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
 [2]
 http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

 E. Ribeiro





Re: Cold boot performance problems

2010-10-08 Thread Dave Viner
Has anyone found solid step-by-step docs on how to raid0 the ephemeral disks
in ec2 for use by Cassandra?

On Fri, Oct 8, 2010 at 12:11 PM, Jason Horman jhor...@gmail.com wrote:

 We are currently using EBS with 4 volumes striped with LVM. Wow, we
 didn't realize you could raid the ephemeral disks. I thought the
 opinion for Cassandra though was that the ephemeral disks were
 dangerous. We have lost of a few machines over the past year, but
 replicas hopefully prevent real trouble.

 How about the sharding strategies? Is it worth it to investigate
 sharding out via multiple keyspaces? Would order preserving
 partitioning help group indexes better for users?

 On Fri, Oct 8, 2010 at 1:53 PM, Jonathan Ellis jbel...@gmail.com wrote:
  Two things that can help:
 
  In 0.6.5, enable the dynamic snitch with
 
  -Dcassandra.dynamic_snitch_enabled=true
  -Dcassandra.dynamic_snitch=cassandra.dynamic_snitch_enabled
 
  which if you are doing a rolling restart will let other nodes route
  around the slow node (at CL.ONE) until it's warmed up (by the read
  repairs in the background).
 
  In 0.6.6, we've added save/load of the Cassandra caches:
  https://issues.apache.org/jira/browse/CASSANDRA-1417
 
  Finally: we recommend using raid0 ephemeral disks on EC2 with L or XL
  instance sizes for better i/o performance.  (Corey Hulen has some
  numbers at http://www.coreyhulen.org/?p=326.)
 
  On Fri, Oct 8, 2010 at 12:36 PM, Jason Horman jhor...@gmail.com wrote:
  We are experiencing very slow performance on Amazon EC2 after a cold
 boot.
  10-20 tps. After the cache is primed things are much better, but it
 would be
  nice if users who aren't in cache didn't experience such slow
 performance.
  Before dumping a bunch of config I just had some general questions.
 
  We are using uuid keys, 40m of them and the random partitioner. Typical
  access pattern is reading 200-300 keys in a single web request. Are uuid
  keys going to be painful b/c they are so random. Should we be using less
  random keys, maybe with a shard prefix (01-80), and make sure that our
  tokens group user data together on the cluster (via the order preserving
  partitioner)
  Would the order preserving partitioner be a better option in the sense
 that
  it would group a single users data to a single set of machines (if we
 added
  a prefix to the uuid)?
  Is there any benefit to doing sharding of our own via Keyspaces. 01-80
  keyspaces to split up the data files. (we already have 80 mysql shards
 we
  are migrating from, so doing this wouldn't be terrible implementation
 wise)
  Should a goal be to get the data/index files as small as possible. Is
 there
  a size at which they become problematic? (Amazon EC2/EBS fyi)
 
  Via more servers
  Via more cassandra instances on the same server
  Via manual sharding by keyspace
  Via manual sharding by columnfamily
 
  Thanks,
  --
  -jason horman
 
 
 
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of Riptano, the source for professional Cassandra support
  http://riptano.com
 



 --
 -jason



Re: Advice on settings

2010-10-07 Thread Dave Viner
Also, as a note related to EC2, choose whether you want to be in multiple
availability zones.  The highest performance possible is to be in a single
AZ, as all those machines will have *very* high speed interconnects.  But,
individual AZs also can suffer outages.  You can distribute your instances
across, say, 2 AZs, and then use a RackAwareStrategy to force replication to
put at least 1 copy of the data into the other AZ.

Also, it's easiest to stay within a single Region (in EC2-speak).  This
allows you to use the internal IP addresses for Gossip and Thrift
connections - which means you do not pay inbound-outbound fees for the data
xfer.

HTH,
Dave Viner


On Thu, Oct 7, 2010 at 10:26 AM, B. Todd Burruss bburr...@real.com wrote:

 if you are updating columns quite rapidly, you will scatter the columns
 over many sstables as you update them over time.  this means that a read of
 a specific column will require looking at more sstables to find the data.
  performing a compaction (using nodetool) will merge the sstables into one
 making your reads more performant.  of course the more columns, the more
 scattering around, the more I/O.

 to your point about sharing the data around.  adding more machines is
 always a good thing to spread the load - you add RAM, CPU, and persistent
 storage to the cluster.  there probably is some point where enough machines
 creates a lot of network traffic, but 10 or 20 machines shouldn't be an
 issue.  don't worry about trying to hit a node that has the data unless your
 machines are connected across slow network links.


 On 10/07/2010 12:48 AM, Dave Gardner wrote:

 Hi all

 We're rolling out a Cassandra cluster on EC2 and I've got a couple if
 questions about settings. I'm interested to hear what other people
 have experienced with different values and generally seek advice.

 *gcgraceseconds*

 Currently we configure one setting for all CFs. We experimented with
 this a bit during testing, including changing from the default (10
 days) to 3 hours. Our use case involves lots of rewriting the columns
 for any given keys. We probably rewrite around 5 million per day.

 We are thinking of setting this to around 3 days for production so
 that we don't have old copies of data hanging round. Is there anything
 obviously wrong with this? Out of curiosity, would there be any
 performance issues if we had this set to 30 days? My understanding is
 that it would only affect the amount of disk space used.

 However Ben Black suggests here that the cleanup will actually only
 impact data deleted through the API:

 http://comments.gmane.org/gmane.comp.db.cassandra.user/4437

 In this case, I guess that we need not worry too much about the
 setting since we are actually updating, never deleting. Is this the
 case?


 *Replication factor*

 Our use case is many more writes than reads, but when we do have reads
 they're random (we're not currently using hadoop to read entire CFs).
 I'm wondering what sort of level of RF to have for a cluster. We
 currently have 12 nodes and RF=4.

 To improve read performance I'm thinking of upping the number of nodes
 and keeping RF at 4. My understanding is that this means we're sharing
 the data around more. However it also means a client read to a random
 node has less chance of actually connecting to one of the nodes with
 the data on. I'm assuming this is fine. What sort of RFs do others
 use? With a huge cluster like the recently mentioned 400 node US govt
 cluster, what sort of RF is sane?

 On a similar note (read perf), I'm guessing that reading at weak
 consistency level will bring gains. Gleamed from this slide amongst
 other places:


 http://www.slideshare.net/mobile/benjaminblack/introduction-to-cassandra-replication-and-consistency#13

 Is this true, or will read repair still hammer disks in all the
 machines with the data on? Again I guess it's better to have low RF so
 there are less copied of the data to inspect when doing read repair.
 Will this result in better read performance?

 Thanks

 dave







Re: Dazed and confused with Cassandra on EC2 ...

2010-09-17 Thread Dave Viner
Hi Jedd,

I'm using Cassandra on EC2 as well - so I'm quite interested.

Just to clarify your post - it sounds like you have 4 questions/issue:

1. Writes have slowed down significantly.  What's the logical explanation?
And what is the logical solution/options to solve it?

2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and the 2
new ones have 40 GB.  What's the recommended practice for rebalancing (i.e.,
when should you do it), what's the actual procedure, and what's the expected
impact of it?

3. Cassandra nodes disappear.  (I'm not quite clear what this means.)

4. You took a machine offline without decommissioning it from the cluster.
 Now the machine is gone, but the other nodes (in Gossip logs) report that
they are still looking for it.  How do you stop nodes from looking for a
removed node?

I'm not trying to put words in your mouth - but I want to make sure that I
understand what you're asking about (because I have similar ec2-related
thoughts).  Let me know if this is an accurate summary.

Dave Viner


On Fri, Sep 17, 2010 at 7:41 AM, Jedd Rashbrooke 
jedd.rashbro...@imagini.net wrote:

  Howdi,

  I've just landed in an experiment to get Cassandra going, and
  fed by PHP via Thrift via Hadoop, all running on EC2.  I've been
  lurking a bit on the list for a couple of weeks, mostly reading any
  threads with the word 'performance' in them.  Few people have
  anything polite to say about EC2, but I want to just throw out
  some observations and get some feedback on whether what
  I'm seeing is even approaching any kind of normal.

  My background is mostly *nix and networking, with half-way
  decent understanding of DB's -- but Cassandra, Hadoop, Thrift
  and EC2 are all fairly new to me.

  We're using a four-node decently-specced (m2.2xlarge, if you're
  EC2-aware) cluster - 32GB, 4-core, if you're not :)  I'm using
  Ubuntu with the Deb packages for Cassandra and Hadoop, and
  some fairly conservative tweaks to things like JVM memory
  (bumping them up to 4GB, then 16GB).

  One of our insert jobs - a mapper only process - was running
  pretty fast a few days ago.  Somewhere around a million lines
  of input, split into a dozen files, inserting via a Hadoop job in
  about a half hour.  Happy times.  This was when the cluster
  was modestly sized - 20-50GB.  It's now about 200GB, and
  performance has dropped by an order of magnitude - perhaps
  5-6 hours to do the same amount of work, using the same
  codebase and the same input data.

  I've read that reads slow down as the DB grows, but had an
  expectation that writes would be consistently snappy.  How
  surprising is this performance drop given the DB growth?

  My 4-node cluster started off as a 2-node - and now nodetool
  ring suggests the two original nodes are 200GB each, and
  the newer two are 40GB.  Is this normal?  Would a rebalance
  likely improve performance substantially?  My feeling is that
  it would be expensive to perform.

  EC2 seems to get a bad rap, and we're feeling quite a bit of
  pain, which is sad given the (on paper) spec of the machines,
  and the cost - over US$3k/month for the cluster.  I've split
  Cassandra commitlog, Cassandra data, hadoop(hdfs) and
  tmp onto separate 'spindles' - observations so far suggest
  late '90's disk IO speed (15MB max sustained writes, one
  machine, one disk to another), and consistently inconsistent
  performance (identical machine next to it running the same
  task at the same time was getting 28MB) over several hours.

  Cassandra nodes seems to disappear too easily - even
  with just one core (out of four) maxed out with a jsvc task,
  minimal disk or network activity, the machine feels very
  sluggish.  Tailing the cassandra logs hints that it's doing
  hinted handoffs and occasionally compaction tasks.  I've
  never seen this kind of behaviour - and suspect this is
  more a feature of EC2.

  Gossip now seems to be pining the loss of an older machine
  (that I stupidly took offline briefly - EC2 gave it a new IP address
  when it came back).  There's nothing in the storage-conf to
  refer to the old address, all 4 Cassandra daemons have been
  re-started several times since, but gossip occasionally (a day
  later) says that it is looking for it - and more worrying that
  it is 'now part of the cluster'.  I'm unsure if this is just an
  irritation or part of the underlying problem.

  What I'm going to do next is to try importing some data into
  a local machine - it's just time-consuming to pull in our S3
  data - and see if I can fake up to around the same capacity
  and watch for performance degradation.

  I'm also toying with the idea of going from 4 to 8 nodes,
  but I'm clueless on whether / how much this would help.

  As I say, though, I'm keen on anyone else's observations on
  my observations - I'm painfully aware that I'm juggling a lot
  of unknown factors at the moment.

  cheers,
  Jedd.



Re: Monitoring with Cacti

2010-09-12 Thread Dave Viner
I haven't tried cacti, but I'm using CloudKick as an external service for
monitoring Cassandra.  It's super easy to get setup.  Happy to share my
setup if that'd help.

It doesn't currently monitor JMX information, but it does offer some basic
checks like thread pool and column family stats -
https://support.cloudkick.com/Cassandra_Checks.

Dave Viner


On Fri, Sep 10, 2010 at 8:31 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Fri, Sep 10, 2010 at 7:29 PM, aaron morton aa...@thelastpickle.com
 wrote:
  Am going through the rather painful process of trying to monitor
 cassandra using Cacti (it's what we use at work). At the moment it feels
 like a losing battle :)
 
  Does anyone know of some cacti resources for monitoring the JVM or
 Cassandra metrics other than...
 
  mysql-cacti-templates
  http://code.google.com/p/mysql-cacti-templates/
  - provides templates and data sources that require ssh and can monitor
 JVM heap and a few things.
 
  Cassandra-cacti-m6
  http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp
  Coded for version 0.6* , have made some changes to stop it looking for
 stats that no longer exist. Missing some metrics I think but it's probably
 the best bet so far. If I get it working I'll contribute it back to them.
 Most of the problems were probably down the how much effort it takes to
 setup cacti.
 
  jmxterm
  http://www.cyclopsgroup.org/projects/jmxterm/
  Allows for command line access to JMX. I started down the path of writing
 a cacti data source to use this just to see how it worked. Looks like a lot
 of work.
 
  Thanks for any advice.
  Aaron
 
 

 Setting up cacti is easy, the second time, and third time :)
 As for cassandra-cacti-m6 (i am the author). Unfortunately, I have
 been fighting the jmx switcharo battle for about 3 years now
 hadoop/hbase/cassandra/hornetq/vserver

 In a nutshell there is ALWAYS work involved. First, is because as you
 noticed attributes change/remove/add/renamed. Second it takes a human
 to logically group things together. For example, if you have two items
 cache hits and cache misses. You really do not want two separate
 graphs that will scale independently. You want one slick stack graph,
 with nice colors, and you want a CDEF to calculate the cache hit
 percentage by dividing one into the other and show that at the bottom.

 If you want to have a 7.0 branch to cassandra-cacti-m6 I would love
 the help. We are not on 7.0 yet so I have not had the time just to go
 out and make graphs for a version we are not using yet :) but if you
 come up with patches they are happily accepted.

 Edward



Re: Cassandra HAProxy

2010-08-30 Thread Dave Viner
FWIW - we've been using HAProxy in front of a cassandra cluster in
production and haven't run into any problems yet.  It sounds like our
cluster is tiny in comparison to Anthony M's cluster.  But I just wanted to
mentioned that others out there are doing the same.

One thing in this thread that I thought was interesting is Ben's initial
comment the presence of the proxy precludes clients properly backing off
from nodes returning errors.  I think it would be very cool if someone
implemented a mechanism for haproxy to detect the error nodes and then
enable it to drop those nodes from the rotation.  I'd be happy to help with
this, as I know how it works with haproxy and standard web servers or other
tcp servers.  But, I'm not sure how to make it work with Cassandra, since,
as Ben points out, it can return valid tcp responses (that say
error-condition) on the standard port.

Dave Viner


On Sun, Aug 29, 2010 at 4:48 PM, Anthony Molinaro 
antho...@alumni.caltech.edu wrote:


 On Sun, Aug 29, 2010 at 12:20:10PM -0700, Benjamin Black wrote:
  On Sun, Aug 29, 2010 at 11:04 AM, Anthony Molinaro
  antho...@alumni.caltech.edu wrote:
  
  
   I don't know it seems to tax our setup of 39 extra large ec2 nodes, its
   also closer to 24000 reqs/sec at peak since there are different tables
   (2 tables for each read and 2 for each write)
  
 
  Could you clarify what you mean here?  On the face of it, this
  performance seems really poor given the number and size of nodes.

 As you say I would expect to achieve much better performance given the node
 size, but if you go back and look through some of the issues we've seen
 over time, you'll find we've been hit with nodes being too small, having
 too few nodes to deal with request volume, having OOMs, having bad
 sstables,
 having the ring appear different to different nodes, and several other
 problems.

 Many of i/o problems presented themselves as MessageDeserializer pool
 backups
 (although we stopped having these since Jonathan was by and suggested row
 cache of about 1Gb, thanks Riptano!).  We currently have mystery OOMs
 which are probably caused by GC storms during compactions (although usually
 the nodes restart and compact fine, so who knows).  I also regularly watch
 nodes go away for 30 seconds or so (logs show node goes dead, then comes
 back to life a few seconds later).

 I've sort of given up worrying about these, as we are in the process of
 moving this cluster to our own machines in a colo, so I figure I should
 wait until they are moved, and see how the new machines do before I worry
 more about performance.

 -Anthony

 --
 
 Anthony Molinaro   antho...@alumni.caltech.edu



Re: Thrift + PHP: help!

2010-08-19 Thread Dave Viner
I am a user of the perl api - so I'd like to lurk in case there are things
that can benefit both perl  php.

Dave Viner


On Wed, Aug 18, 2010 at 1:35 PM, Gabriel Sosa sosagabr...@gmail.com wrote:

 I would like to help with this too!


 On Wed, Aug 18, 2010 at 5:15 PM, Bas Kok bakot...@gmail.com wrote:

 I have some experience in this area and would be happy to help out as
 well.
 --
 Bas


 On Wed, Aug 18, 2010 at 8:26 PM, Dave Gardner 
 dave.gard...@imagini.netwrote:

 I'm happy to assist. Having a robust PHP implementation would help us
 greatly.

 Dave

 On Wednesday, August 18, 2010, Jeremy Hanna jeremy.hanna1...@gmail.com
 wrote:
  As Jonathan mentioned in his keynote at the Cassandra Summit, the
 thrift + php has some bugs and is maintainerless right now.
 
  Is there anyone out there in the Cassandra community that is adept at
 PHP that could help out with the thrift + php work?  It would benefit all
 who use Cassandra with PHP.
 
  Bryan Duxbury, a thrift developer/committer, said if someone really
 wanted to have a go at making thrift php robust, i would assist them
 heavily.
 
  Please respond to this thread or ask Bryan in the channel.





 --
 Gabriel Sosa
 Si buscas resultados distintos, no hagas siempre lo mismo. - Einstein



Re: error using get_range_slice with random partitioner

2010-08-06 Thread Dave Viner
Funny you should ask... I just went through the same exercise.

You must use Cassandra 0.6.4.  Otherwise you will get duplicate keys.
 However, here is a snippet of perl that you can use.

our $WANTED_COLUMN_NAME = 'mycol';
get_key_to_one_column_map('myKeySpace', 'myColFamily', 'mySuperCol', QUORUM,
\%map);

sub get_key_to_one_column_map
{
my ($keyspace, $column_family_name, $super_column_name,
$consistency_level, $returned_keys) = @_;


my($socket, $transport, $protocol, $client, $result, $predicate,
$column_parent, $keyrange);

$column_parent = new Cassandra::ColumnParent();
$column_parent-{'column_family'} = $column_family_name;
$column_parent-{'super_column'} = $super_column_name;

$keyrange = new Cassandra::KeyRange({
'start_key' = '', 'end_key' = '', 'count' = 10
});


$predicate = new Cassandra::SlicePredicate();
$predicate-{'column_names'} = [$WANTED_COLUMN_NAME];

eval
{
$socket = new Thrift::Socket($CASSANDRA_HOST, $CASSANDRA_PORT);
$transport = new Thrift::BufferedTransport($socket, 1024, 1024);
$protocol = new Thrift::BinaryProtocol($transport);
$client = new Cassandra::CassandraClient($protocol);
$transport-open();


my($next_start_key, $one_res, $iteration, $have_more, $value,
$local_count, $previous_start_key);

$iteration = 0;
$have_more = 1;
while ($have_more == 1)
{
$iteration++;
$result = undef;

$result = $client-get_range_slices($keyspace, $column_parent,
$predicate, $keyrange, $consistency_level);

# on success, results is an array of objects.

if (scalar(@$result) == 1)
{
# we only got 1 result... check to see if it's the
# same key as the start key... if so, we're done.
if ($result-[0]-{'key'} eq $keyrange-{'start_key'})
{
$have_more = 0;
last;
}
}

# check to see if we are starting with some value
# if so, we throw away the first result.
if ($keyrange-{'start_key'})
{
shift(@$result);
}
if (scalar(@$result) == 0)
{
$have_more = 0;
last;
}

$previous_start_key = $keyrange-{'start_key'};
$local_count = 0;

for (my $r = 0; $r  scalar(@$result); $r++)
{
$one_res = $result-[$r];
$next_start_key = $one_res-{'key'};

$keyrange-{'start_key'} = $next_start_key;

if (!exists($returned_keys-{$next_start_key}))
{
$have_more = 1;
$local_count++;
}


next if (scalar(@{ $one_res-{'columns'} }) == 0);

$value = undef;

for (my $i = 0; $i  scalar(@{ $one_res-{'columns'} });
$i++)
{
if ($one_res-{'columns'}-[$i]-{'column'}-{'name'} eq
$WANTED_COLUMN_NAME)
{
$value =
$one_res-{'columns'}-[$i]-{'column'}-{'value'};
if (!exists($returned_keys-{$next_start_key}))
{
$returned_keys-{$next_start_key} = $value;
}
else
{
# NOTE: prior to Cassandra 0.6.4, the
get_range_slices returns duplicates sometimes.
#warn Found second value for key
[$next_start_key]  was [ . $returned_keys-{$next_start_key} . ] now
[$value]!;
}
}
}
$have_more = 1;
} # end results loop

if ($keyrange-{'start_key'} eq $previous_start_key)
{
$have_more = 0;
}

} # end while() loop

$transport-close();
};
if ($@)
{
warn Problem with Cassandra:  . Dumper($@);
}

# cleanup
undef $client;
undef $protocol;
undef $transport;
undef $socket;
}


HTH
Dave Viner

On Fri, Aug 6, 2010 at 7:45 AM, Adam Crain
adam.cr...@greenenergycorp.comwrote:

 Thomas,

 That was indeed the source of the problem. I naively assumed that the token
 range would help me avoid retrieving duplicate rows.

 If you iterate over the keys, how do you avoid retrieving duplicate keys? I
 tried this morning and I seem to get odd results. Maybe this is just a
 consequence of the random partitioner. I really don't care about the order
 of the iteration, but only each key once and that I see all keys is
 important.

 -Adam


 -Original Message-
 From: th.hel...@gmail.com on behalf of Thomas Heller
 Sent: Fri 8/6/2010 7:27 AM
 To: user@cassandra.apache.org
 Subject: Re: error using get_range_slice

Re: Please need help with Munin: Cassandra Munin plugin problem

2010-07-29 Thread Dave Viner
Is your code posted somewhere such that others could try it?

On Thu, Jul 29, 2010 at 5:57 AM, Miriam Allalouf
miriam.allal...@gmail.comwrote:

 Hi,
 Please, can someone  help us with Munin??
 Thanks,
 Miriam


 On Mon, Jul 26, 2010 at 1:58 PM, osishkin osishkin osish...@gmail.com
 wrote:
  Hi,
 
  I'm trying to use Munin to monitor cassandra.
  I've seen other people using munin here ,so I hope someone ran into
  this problem.
  The default plugins are working, so this is definitely a problem with
  the cassandra plugin.
 
  I keep getting errors such as :
  Exception in thread main java.lang.NoClassDefFoundError:
  javax.management.remote.JMXConnector
at org.munin.JMXQuery.disconnect(Unknown Source)
at org.munin.JMXQuery.main(Unknown Source)
  Plugin compactions_bytes exited with status 256. 
  Exception in thread main java.lang.NoClassDefFoundError:
  javax.management.ObjectName
at org.munin.Configuration$FieldProperties.set(Unknown Source)
at org.munin.Configuration.parseString(Unknown Source)
at org.munin.Configuration.parse(Unknown Source)
at org.munin.JMXQuery.main(Unknown Source)
  Plugin jvm_memory exited with status 256. 
 
  However when I call the plugin directly from my console (from
  /etc/munin/plugins) it works. So there must be something very basic
  I'm missing here.
  I'm using RHEL 5 with IBM jre 1.6.
 
  Anyone encountered a similar problem?
  I appologize for writing on an issue that's not purely cassandra here.
  Thank you
 



iterating over all rows keys gets duplicate key returns

2010-07-28 Thread Dave Viner
Hi all,

I'm having a strange result in trying to iterate over all row keys for a
particular column family.  The iteration works, but I see the same row key
returned multiple times during the iteration.

I'm using cassandra 0.6.3, and I've put the code in use at
http://pastebin.com/zz5xJQ8f

Using get_range_slices() and a keyrange with incrementing start_key's,
shouldn't I get an enumeration of the keys such that each key appears only
once ?


In iterating 1000 times, I was given the same rows 8322 times.  Somehow it
seems like something is amiss in how I'm performing the iteration over the
keys.  Any suggestions on how I can properly iterate?

Thanks
Dave Viner


Re: iterating over all rows keys gets duplicate key returns

2010-07-28 Thread Dave Viner
Just as a followup, here's what seems to be the resolution:

1. 0.6.4 should fix this problem.
2. Using OPP as the DHT should solve it as well.
3. Prior to 0.6.4, when using RandomPartitioner as the DHT, there's no good
way to guarantee that you see *all* row keys for a column family.

Strategies tried:

A. iterate over the keys returned until the start_key is identical to the
last key returned.  When start_key == last key returned, exit.
- fails since duplicate keys can appear anywhere, even as the last key
returned.

B. iterate over keys returned, adding the keys to a hash table.  When an
iteration returns no new keys, assume that all keys have been seen and exit.
- this also fails, since a particular result set can be full of duplicates,
but the iteration has not traversed the entire row-key spectrum.

Dave Viner

On Wed, Jul 28, 2010 at 3:48 PM, Rob Coli rc...@digg.com wrote:

 On 7/28/10 2:43 PM, Dave Viner wrote:

 Hi all,

 I'm having a strange result in trying to iterate over all row keys for a
 particular column family.  The iteration works, but I see the same row
 key returned multiple times during the iteration.

 I'm using cassandra 0.6.3, and I've put the code in use at


 For those not playing along on IRC, this was determined to be caused by :

 http://issues.apache.org/jira/browse/CASSANDRA-1042

 Which is fixed in 0.6.4.

 =Rob



Re: Quick Poll: Server names

2010-07-27 Thread Dave Viner
I've seen  used several...

names of children of employees of the company
names of streets near office
names of diseases (lead to very hard to spell names after a while, but was
quite educational for most developers)
names of characters from famous books (e.g., lord of the rings, asimov
novels, etc)


On Tue, Jul 27, 2010 at 7:54 AM, uncle mantis uncleman...@gmail.com wrote:

 I will be naming my servers after insect family names. What do you all use
 for yours?

 If this is something that is too off topic please contact a moderator.

 Regards,

 Michael



Re: non blocking Cassandra with Tornado

2010-07-27 Thread Dave Viner
FWIW - I think this is actually more of a question about Thrift than about
Cassandra.  If I understand you correctly, you're looking for a async
client.  Cassandra lives on the other side of the thrift service.  So, you
need a client that can speak Thrift asynchronously.

You might check out the new async Thrift client in Java for inspiration:

http://blog.rapleaf.com/dev/2010/06/23/fully-async-thrift-client-in-java/

Or, even better, port the Thrift async client to work for python and other
languages.

Dave Viner


On Tue, Jul 27, 2010 at 8:44 AM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  The idea is rather than calling a cassandra client function like
  get_slice(), call the send_get_slice() then have a non blocking wait on
 the
  socket thrift is using, then call recv_get_slice().

 (disclaimer: I've never used tornado)

 Without looking at the generated thrift code, this sounds dangerous.
 What happens if send_get_slice() blocks? What happens if
 recv_get_slice() has to block because you didn't happen to receive the
 response in one packet?

 Normally you're either doing blocking code or callback oriented
 reactive code. It sounds like you're trying to use blocking calls in a
 non-blocking context under the assumption that readable data on the
 socket means the entire response is readable, and that the socket
 being writable means that the entire request can be written without
 blocking. This might seems to work and you may not block, or block
 only briefly. Until, for example, a TCP connection stalls and your
 entire event loop hangs due to a blocking read.

 Apologies if I'm misunderstanding what you're trying to do.

 --
 / Peter Schuller



Re: SV: How to stop cassandra server, installed from debian/ubuntupackage

2010-07-26 Thread Dave Viner
Yes... if you're using debian cassandra you can do:

/etc/init.d/cassandra stop


On Mon, Jul 26, 2010 at 8:05 AM, Lee Parker l...@socialagency.com wrote:

 Which debian/ubuntu packages are you using?  I am using the ones that are
 maintained by Eric Evans and the init.d script stops the server correctly.

 Lee Parker
 On Mon, Jul 26, 2010 at 9:22 AM, miche...@hermanus.cc wrote:

 This is how I have been doing it:
 pkill cassandra

 then I do a netstat -anp | grep 8080
 I look for the java service I'd running and then kill that java I'd
 e.g. kill java id
 --Original Message--
 From: Thorvaldsson Justus
 To: 'user@cassandra.apache.org'
 ReplyTo: user@cassandra.apache.org
 Subject: SV: How to stop cassandra server, installed from
 debian/ubuntupackage
 Sent: Jul 26, 2010 4:14 PM

 I use standard close, CTRL C, I don't run it as deamon
 Dunno but think it works fine =)

 -Ursprungligt meddelande-
 Från: o...@notrly.com [mailto:o...@notrly.com]
 Skickat: den 26 juli 2010 15:52
 Till: user@cassandra.apache.org
 Ämne: How to stop cassandra server, installed from debian/ubuntu package

 Hi, this might be a dumb question, but I was wondering how do i stop the
 cassandra server.. I installed it using the debian package, so i start
 cassandra by running /etc/init.d/cassandra. I looked at the script and
 tried /etc/init.d/cassandra stop, but it looks like it just tries to start
 cassandra again, so i get the port in use exception.

 Thanks



 Sent via my BlackBerry from Vodacom - let your email find you!





Re: Design questions/Schema help

2010-07-26 Thread Dave Viner
AFAIK, atomic increments are not available.  There recently has been quite a
bit of discussion about them.  So, you might search the archives.


Dave Viner

On Mon, Jul 26, 2010 at 7:02 PM, Mark static.void@gmail.com wrote:

  On 7/26/10 6:06 PM, Dave Viner wrote:

 I'd love to hear other's opinions here... but here are my 2 cents.

  With Cassandra, you need to think of the queries - which you've pretty
 much done.

  For the most popular queries, you could do something like:

  ColumnFamily Name=QueriesCounted
 ComparesWith=UTF8Type
 /
 And then access it as:
 key-space.QueriesCounted['query-foo-bar'] = $count;

  This makes it easy to get the count for any particular query.  I'm not
 sure the best way to store the top counts idea.  Perhaps a secondary
 process which iterates over all the queries to see which sorts the query
 values by count, and then stores them into another ColumnFamily.

  You could use the same idea for the last query (session ids by query)

  ColumnFamily Name=QueriesRecorded
 ComparesWith=UTF8Type
 ColumnType=super
 CompareSubcolumnsWith=TimeUUIDType
 /
 And then access it as:
 key-space. QueriesRecorded['query-foo-bar'][timeuuid] = session-id;

  Actually, if you used that idea (queries-recorded), you could generate
 the counts and aggregates from that directly in a hadoop post-processing...

  But perhaps others will have better ideas.  If you haven't read
 http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model, go read it
 now.  It won't answer your question directly, but will describe the process
 of modeling a blog in cassandra so you can get a sense of the process.

  Dave Viner




 On Mon, Jul 26, 2010 at 4:46 PM, Mark static.void@gmail.com wrote:

  We are thinking about using Cassandra to store our search logs. Can
 someone point me in the right direction/lend some guidance on design? I am
 new to Cassandra and I am having trouble wrapping my head around some of
 these new concepts. My brain keeps wanting to go back to a RDBMS design.

 We will be storing the user query, # of hits returned and their session
 id. We would like to be able to answer the following questions.

 - What is the n most popular queries and their counts within the last x
 (mins/hours/days/etc). Basically the most popular searches within a given
 time range.
 - What is the most popular query within the last x where hits = 0. Same as
 above but with an extra where clause
 - For session id x give me all their other queries
 - What are all the session ids that searched for 'foos'

 We accomplish the above functionality w/ MySQL using 2 tables. One for the
 raw search log information and the other to keep the aggregate/running
 counts of queries.

 Would this sort of ad-hoc querying be better implemented using Hadoop +
 Hive? If so, should I be storing all this information in Cassandra then
 using Hadoop to retrieve it?

 Thanks for your suggestions


  Perhaps a secondary process which iterates over all the queries to see
 which sorts the query values by count, and then stores them into another
 ColumnFamily.

 - I was trying to avoid this. Is there some sort of atomic increment
 feature available? I guess I could do the same thing we are currently doing
 which is...

 a) store full query details into table A
 b) query table B for aggregate count of query 'foo' then store count + 1



Re: Cassandra Chef recipe and EC2 snitch

2010-07-22 Thread Dave Viner
You don't need the ec2snitch necessarily.  AFAIK, It's meant to be a
better way of detecting where your ec2 instances are.  But, unless you're
popping instances all the time, I don't think it's worth it.

Check out the step-by-step guide on that same page.  Pure EC2 api calls to
setup your cluster.

You can also use rackaware-ness in EC2.  Just add in the PropertyFile
endpoint and put your rack file in /etc/cassandra/rack.properties.

Dave Viner

On Thu, Jul 22, 2010 at 10:08 AM, Allan Carroll alla...@gmail.com wrote:

 Hi all,

 I'm setting up a new cluster on EC2 for the first time and looking at the
 wiki cloud setup page (http://wiki.apache.org/cassandra/CloudConfig).
 There's a chef recipe linked there that mentions an ec2snitch. The link
 doesn't seem to go where it says it does. Does anyone know where those
 resources have gone or are they no longer available?

 Thanks
 -Allan


Re: Suggestion for the storage.conf

2010-07-19 Thread Dave Viner
Added: http://wiki.apache.org/cassandra/StorageConfiguration


On Mon, Jul 19, 2010 at 2:55 AM, Dimitry Lvovsky dimi...@reviewpro.comwrote:

 I think it would be a good idea to add a bit more explanation
 storage-conf.xml/wiki regarding the replication factor.  It caused some
 confusion until we dug around the mail archiveto realize  that our
 UnavailableExceptions were caused by our incorrect assumption and that  RF=1
 does NOT mean that this nodes data will be replicated to one another node --
 but rather the data will only exist in one node :-s.


 --
 Dimitry Lvovsky
 ReviewPro
 www.reviewpro.com




Re: Cassandra benchmarking on Rackspace Cloud

2010-07-19 Thread Dave Viner
This may be too much work... but you might consider building an Amazon EC2
AMI of your nodes.  This would let others quickly boot up your nodes and run
the stress test against it.

I know you mentioned that you're using Rackspace Cloud.  I'm not super
familiar with the internals of RSCloud, but perhaps they have something
similar?

This feels like the kind of problem that might be easier for someone else to
setup and quickly test.  (The beauty of the virtual server - quick setup and
quick tear down)

Dave Viner


On Mon, Jul 19, 2010 at 10:24 AM, Peter Schuller 
peter.schul...@infidyne.com wrote:

  I ran this test previously on the cloud, with similar results:
 
  nodes   reads/sec
  1   24,000
  2   21,000
  3   21,000
  4   21,000
  5   21,000
  6   21,000
 
  In fact, I ran it twice out of disbelief (on different nodes the second
 time) to essentially identical results.

 Something other than cassandra just *has* to be fishy here unless
 there is some kind of bug causing communication with nodes that should
 not be involved. It really sounds like there is a hidden bottleneck
 somewhere.

 You already mention that you've run multiple test clients so that the
 client is not a bottleneck. What about bandwidth? I could imagine
 bandwidth adding up a bit given those requests rate. Is it possible
 all the nodes are communicating with each other via some bottleneck
 (like 100 mbit)?

 What does the load look like when you observe the nodes during
 bottlenecking? How much bandwidth is each machine pushing (ifstat,
 nload, etc); is Cassandra obviously CPU bound or does it look idle?

 Presumably Cassandra is not perfectly concurrent and you may not
 saturate 8 cores under this load necessarily, but as you add more and
 more nodes and still only reaching 21k/sec you should come past a
 point where you're not even saturating a single core...

 *Something* else is probably going on.

 --
 / Peter Schuller



Re: A very short summary on Cassandra for a book

2010-07-15 Thread Dave Viner
I am no expert... but parts seem accurate, parts not.

Cassandra stores four or five dimension associated arrays
not sure what you're counting as a dimension of the associated array, but
here are the 2 associative array-like syntaxes:

ColumnFamily[row-key][column-name] = value1
ColumnFamily[row-key][super-column-name][column-name] = value2


The first dimension is fixed on creation of the database but the rest can
be infinitely large
I don't understand this sentence.  The definition of a ColumnFamily is set
by the configuration file (storage-conf.xml).  If you change it, and restart
a node, that node will use the new definition of the CF.

It is true that the number of columns can be large.  I have no idea if it's
actually infinite - but more or less.

Also, it's probably not precise to call it a database, since that tends to
invoke images of things like MySQL, Oracle, Postgres, etc.


Inserts are super fast and can happen to any
database server in the cluster.
Yes, this is true.

However, the system is append only there so there is no in-place update
operation like increment
The first part is not quite true.  There is appending, but there is no
increment that's guaranteed universal.  Cassandra is eventually
consistent.  So atomic increment doesn't really work in the eventual
world.  But, more precisely, one can add, update, change, modify, delete
rows, columns, and values at any time from any node.

Also sorting happens on insert time
Yes, I believe this is true.

Dave Viner


On Thu, Jul 15, 2010 at 4:26 PM, Karoly Negyesi chx1...@gmail.com wrote:

 Hi,

 I am writing a scalability chapter in a book and I need to mention
 Apache Cassandra although it's just a mention. Still I would not like
 to be sloppy and would like to get verification whether my summary is
 accurate. Cassandra stores four or five dimension associated arrays.
 The first dimension is fixed on creation of the database but the rest
 can be infinitely large. Inserts are super fast and can happen to any
 database server in the cluster. However, the system is append only
 there so there is no in-place update operation like increment. Also
 sorting happens on insert time.

 Thanks

 Karoly Negyesi



Re: Elastic Load Balancing Cassandra

2010-07-13 Thread Dave Viner
I haven't used ELB, but I've setup HAProxy to do it... appears to work well
so far.

Dave Viner


On Tue, Jul 13, 2010 at 3:30 PM, Brian Helfrich helfrich9...@gmail.comwrote:

 Hi, has anyone been able to load balance a Cassandra cluster with an AWS
 Elastic Load Balancer? I've setup an ELB with the obvious settings (namely,
 --listener lb-port=9160,instance-port=9160,protocol=TCP) but client's
 simply hang trying to load records from the ELB hostname:9160.

 Thanks,
 --Brian.



Re: Is anyone using version 0.7 schema update API

2010-07-13 Thread Dave Viner
Check out step 4 of this page:
https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP

./compiler/cpp/thrift -gen php
../PATH-TO-CASSANDRA/interface/cassandra.thrift

That is how to compile the thrift client from the cassandra bindings. Just
replace the php with the language of your choosing. According to
http://wiki.apache.org/thrift/, Thrift has generators for C++, C#, Erlang,
Haskell, Java, Objective C/Cocoa, OCaml, Perl, PHP, Python, Ruby, and
Squeak

HTH
Dave Viner

On Tue, Jul 13, 2010 at 6:05 PM, GH gavan.h...@gmail.com wrote:


 To be honest I do not know how to regenerate the binidings, I will look
 into that.
 ollowing your email, I went on and took the unit test code and created a
 client. Given that this code works I am guessing that the thrift bindings
 are in place and it is more that the client code does not support the new
 functions yet.
 I might be off track and don't know if there it is appropriate for someone
 as new to this as I am to make changes to the client and submit them
 (especially if some one else is already doing that). I could do that, if it
 helped the group.





 On Wed, Jul 14, 2010 at 2:12 AM, Benjamin Black b...@b3k.us wrote:

 I updated the Ruby client to 0.7, but I am not a Cassandra committer
 (and not much of a Java guy), so haven't touched the Java client.  Is
 there more to it than regenerating Thrift bindings?

 On Tue, Jul 13, 2010 at 1:42 AM, GH gavan.h...@gmail.com wrote:
  They are not complicated, its more that they are not in the package that
  they should be in.
  I assume the client package exposes the functionality of the server and
 it
  does not have the ability to manage the tables in the database that to
 me
  seems to be extremely limiting.
  When I did not see that code in place I assume that it is not complete
 or
  that I have not got the right code drop.
  From your commetns it sounds like you don't support the Java client code
  base in line with the ruby code. Which I think is limiting but is just
 the
  way it is.
 
  On Tue, Jul 13, 2010 at 8:53 AM, Benjamin Black b...@b3k.us wrote:
 
  I guess I don't understand what is so complicated about the schema
  management calls that numerous examples are needed.
 
  On Mon, Jul 12, 2010 at 4:43 AM, GH gavan.h...@gmail.com wrote:
   Hi,
   My problem is that I cannot locate Java equivalents to the api calls
 you
   present in the ruby files you have presented. They are not visible in
   the
   java client packages I have (My code is not that old of trunk).
  
   I located the code below from some of the unit test code files
 This
   code
   will have to be refactored to create a test.
   This is all I could find and it seems that there must be better
 client
   examples than this.
  
   I expected to see client code in the  org.apache.cassandra.cli
 package
   but
   there was nothing there. (I searched all of the code for calls to
 these
   API's in the end)
   Where should I be looking to get proper Java code samples ?
   Regards
   Gavan
  
  
   Here is what I was about to refactor...
  
   TSocket socket = new
   TSocket(DatabaseDescriptor.getListenAddress().getHostName(),
   DatabaseDescriptor.getRpcPort());
  
   TTransport transport;
  
   transport = socket;
  
   TBinaryProtocol binaryProtocol = new TBinaryProtocol(transport,
 false,
   false);
  
   Cassandra.Client cassandraClient = new
 Cassandra.Client(binaryProtocol);
  
   transport.open();
  
   thriftClient = cassandraClient;
  
   SetString keyspaces = thriftClient.describe_keyspaces();
  
   if (!keyspaces.contains(KEYSPACE))
  
   {
  
   ListCfDef cfDefs = new ArrayListCfDef();
  
   thriftClient.system_add_keyspace(new KsDef(KEYSPACE,
   org.apache.cassandra.locator.RackUnawareStrategy, 1, cfDefs));
  
   }
  
   thriftClient.set_keyspace(KEYSPACE);
  
   CfDef cfDef = new CfDef(KEYSPACE, COLUMN_FAMILY);
  
   try
  
   {
  
   thriftClient.system_add_column_family(cfDef);
  
   }
  
   catch (InvalidRequestException e)
  
   {
  
   throw new RuntimeException(e);
  
   }
  
  
  
  
  
  
   On Mon, Jul 12, 2010 at 4:34 PM, Benjamin Black b...@b3k.us wrote:
  
   http://github.com/fauna/cassandra/tree/master/lib/cassandra/0.7/
  
   Unclear to me what problems you are experiencing.
  
   On Sun, Jul 11, 2010 at 2:27 PM, GH gavan.h...@gmail.com wrote:
Hi Dop,
   
Do you have any code on dynamically creating KeySpace and
ColumnFamily.
Currently I was all but creating a new client to do that which
 seems
to
be
the wrong way.
If you have something that works that will put me on the right
 track
I
hope.
   
   
Gavan
   
   
On Mon, Jul 12, 2010 at 2:41 AM, Dop Sun su...@dopsun.com
 wrote:
   
Based on current source codes in the head, moving from 0.6.x to
 0.7,
means
some code changes in the client side (other than server side
changes,
like
storage_conf.xml).
   
   
   
Something like:
   
1.   New Clock class instead

Re: How to add a new Keyspace?

2010-07-08 Thread Dave Viner
Here are my notes on how to make schema changes in 0.6:

# Empty the commitlog with nodetool drain.
= NOTE while this is running, the node will not accept writes.
# Shutdown Cassandra and verify that there is no remaining data in the
commitlog.
= HOW to verify?
# Delete the sstable files (-Data.db, -Index.db, and -Filter.db) for any CFs
removed, and rename the files for any CFs that were renamed.
# Make necessary changes to your storage-conf.xml.
# Start Cassandra back up and your edits should take effect.

Related URLs:
http://wiki.apache.org/cassandra/FAQ#modify_cf_config
http://www.mail-archive.com/user@cassandra.apache.org/msg02498.html

HTH
Dave Viner


On Wed, Jul 7, 2010 at 11:39 PM, Peter Schuller peter.schul...@infidyne.com
 wrote:

If I want to add a new Keyspace, does it mean I have to distribute my
  storage-conf.xml to whole nodes? and restart whole nodes?

 I *think* that is the case in Cassandra 0.6, but I'll let someone else
 comment. In trunk/upcoming 7 there are live schema upgrades that
 propagate through the cluster:

   http://wiki.apache.org/cassandra/LiveSchemaUpdates


 --
 / Peter Schuller



Re: Property file snitch for Cassandra?

2010-07-07 Thread Dave Viner
After more investigation, as well as a bunch of trial and error, here's what
seems to be happening.

1. The rack.properties file key values (the stuff before the =) must match
the toString() method of the InetAddress object for the host.
2. (In EC2) the InetAddress of a node *other* than the one you are on will
have no hostname, but will have an IP address.
3. (In EC2) the InetAddress of the node on which the rack.properties is read
will be the internal name + '/' + IP address.

So, the solution that *appears* to work for EC2 (without touching DNS) is to
have 2 lines in the rack.properties file for each host:


/10.196.35.64=DC1:RAC1
ip-10-196-35-64.ec2.internal/10.196.35.64=DC1:RAC1

The first line is used by the *other* nodes in the cluster to identify this
server as belonging to DC1 and RAC1.  The second line is used by *this*
 node to identify itself as in DC1 and RAC1.

I've not yet proven to myself that this is accurate, but it definitely stops
the error messages and , from looking at the code, seems like it should
work.

Is this correct?

Thanks
Dave Viner


On Wed, Jul 7, 2010 at 6:22 PM, Eric Evans eev...@rackspace.com wrote:


 Let's move this to the user@ list...

 On Wed, 2010-07-07 at 16:32 -0700, Dave Viner wrote:

  After starting up my cluster, I see this one of the system.log :
 
 
  ERROR [GMFD:1] 2010-07-07 23:27:46,044 PropertyFileEndPointSnitch.java
  (line 91) Could not find end point information for /10.202.159.32,
  will use default.
 
 
  However, I definitely have that IP listed in
  the /etc/cassandra/rack.properties file on that machine:
 
 
  # grep 10.202.159.32 /etc/cassandra/rack.properties
  10.202.159.32\:7000=DC1:RAC1
  #
 
 
  The \:7000 is from the sample file and description
  at
 http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.6.3/contrib/property_snitch/README.txt
 .
 
 
  Is there some other format I should be using?


 --
 Eric Evans
 eev...@rackspace.com




Backing up the data stored in cassandra

2010-07-07 Thread Dave Viner
Hi all,

What is the recommended strategy for backing up the data stored inside
cassandra?

I realized that Cass. is a distributed database, and with a decent
replication factor, backups are already done in some sense.  But, as a
relatively new user, I'm always concerned that the data is only within the
system and not stored *anywhere* else.

In an earlier email in the list, the recommendation was:

Until tickets 193 and 520 are done, the easiest thing is to copy all
the sstables from the other nodes that have replicas for the ranges it
is responsible for (e.g. for replication factor of 3 on rack unaware
partitioner, the nodes before it and the node after it on the right
would suffice), and then run nodeprobe cleanup to clear out the
excess.

Is this still the recommended approach?  If I backed up the files in
DataDirectories/*, is it possible to restore a node using those files?
 (That is, bring up a new node, copy the backed up files from the
crashed node onto the new node, then have the new node join the
cluster?)


Thanks

Dave Viner