Re: new node gets no data

2012-03-16 Thread aaron morton
ahh, I think you may have hit a corner case here. 

If the RF still 1 ? 

 INFO [AntiEntropySessions:1] 2012-03-16 06:15:13,727
 AntiEntropyService.java (line 663) [repair #%s] No neighbors to repair
 with on range %s: session completed
Means there are no nodes which share the range with this node. So there is 
nothing to repair. 

To put it another way: As far is 161.101 is concerned none of the keys it is 
responsible for are stored on another node. So there are no other nodes that 
could be involved in a repair session. 

It looks like some data may have been written to 161.101 so I think the safest 
approach would be:
* increase the RF to 2
* repair
* decrease the RF to 1

When you added the node was auto_bootstrap enabled ? I would have thought that 
would stream data from the first node to the new one. 

Cheers

  
-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 16/03/2012, at 7:22 PM, Thorsten von Eicken wrote:

 Thanks for the suggestion Aaron, unfortunately, that seems to do
 absolutely nothing:
 
 # nodetool -h localhost repair
  INFO [RMI TCP Connection(160)-127.0.0.1] 2012-03-16 06:15:13,718
 StorageService.java (line 1770) Starting repair command #1, repairing 1
 ranges.
 INFO [AntiEntropySessions:1] 2012-03-16 06:15:13,727
 AntiEntropyService.java (line 658) [repair
 #6472b290-6f2f-11e1--472739b10cff] new session: will sync
 /10.80.161.101 on range (0,85070591730234615865843651857942052864] for
 rslog_production.[users, req_text, req_attr_idx, req_word_idx,
 req_word_freq, sessions, requests, info]
 INFO [AntiEntropySessions:1] 2012-03-16 06:15:13,727
 AntiEntropyService.java (line 663) [repair #%s] No neighbors to repair
 with on range %s: session completed
 INFO [RMI TCP Connection(160)-127.0.0.1] 2012-03-16 06:15:13,727
 StorageService.java (line 1807) Repair command #1 completed successfully
 
 Stumped...
TvE
 
 
 On 3/15/2012 6:41 PM, aaron morton wrote:
 trying running nodetool repair on 10.80.161.101 and then cleanup
 on 10.102.37.168 if everything is ok. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 16/03/2012, at 6:45 AM, Thorsten von Eicken wrote:
 
 I added a second node to a single-node ring. RF=1. I can't get the new
 node to receive any data. Logs look fine. Here's what nodetool reports:
 
 # nodetool -h localhost ring
 Address DC  RackStatus State   Load   
 OwnsToken
 
 85070591730234615865843651857942052864
 10.102.37.168   datacenter1 rack1   Up Normal  807.81 GB  
 50.00%  0
 10.80.161.101   datacenter1 rack1   Up Normal  1.15 MB
 50.00%  85070591730234615865843651857942052864
 
 Just a little imbalance. Yes, I use partitioner:
 org.apache.cassandra.dht.RandomPartitioner
 I tried moving the new node's token up/down by 1 and it triggers the log
 messages you'd expect, but no data gets transferred. How do I
 troubleshoot this? Below are the log messages I see when restarting the
 new node:
 
 INFO [main] 2012-03-15 17:31:08,616 AbstractCassandraDaemon.java (line
 120) JVM vendor/version:
 Java HotSpot(TM) 64-Bit Server VM/1.6.0_24
 INFO [main] 2012-03-15 17:31:14,812 CommitLog.java (line 178) Log
 replay complete, 8 replayed mutations
 INFO [main] 2012-03-15 17:31:14,825 StorageService.java (line 390)
 Cassandra version: 1.0.6
 INFO [main] 2012-03-15 17:31:14,825 StorageService.java (line 391)
 Thrift API version: 19.19.0
 INFO [main] 2012-03-15 17:31:14,825 StorageService.java (line 404)
 Loading persisted ring state
 INFO [main] 2012-03-15 17:31:14,834 StorageService.java (line 482)
 Starting up server gossip
 INFO [main] 2012-03-15 17:31:15,372 MessagingService.java (line 247)
 Starting Encrypted Messaging Service on SSL port 7000
 INFO [main] 2012-03-15 17:31:15,376 MessagingService.java (line 268)
 Starting Messaging Service on port 7001
 INFO [main] 2012-03-15 17:31:15,401 StorageService.java (line 579)
 Using saved token 85070591730234615865843651857942052864
 INFO [main] 2012-03-15 17:31:15,402 ColumnFamilyStore.java (line 692)
 Enqueuing flush of Memtable-LocationInfo@645492252(53/66 serialized/live
 bytes, 2 ops)
 INFO [FlushWriter:1] 2012-03-15 17:31:15,403 Memtable.java (line 240)
 Writing Memtable-LocationInfo@645492252(53/66 serialized/live bytes,
 2 ops)
 INFO [FlushWriter:1] 2012-03-15 17:31:15,421 Memtable.java (line 277)
 Completed flushing /mnt/ebs/data/system/LocationInfo-hc-32-Data.db (163
 bytes)
 INFO [main] 2012-03-15 17:31:15,424 StorageService.java (line 948) Node
 /10.80.161.101 state jump to normal
 INFO [main] 2012-03-15 17:31:15,434 StorageService.java (line 589)
 Bootstrap/Replace/Move completed! Now serving reads.
 
 # describe keyspace
 Keyspace: rslog_production:
 Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
 Durable Writes: true
   Options: [replication_factor:1]
 Column Families:
 
 



Re: Bootstrapping a new node to a running cluster

2012-03-16 Thread aaron morton
I think your original plan is sound. 

1. Up the RF to 4. 
2. Add the node with auto_bootstrap true
3. Once bootrapping has finished the new node has all the data it needs. 
4. Check for secondary index creation using describe in the CLI to see which 
are build. You can also see progress using nodetool compactionstats

 I'm a bit puzzled though, I just tried to increase R to 3 in a cluster with 
 N=2. It serves reads and writes without issues CL.one. Is the described 
 restriction is something that will be implemented in the future?
I had a quick glance at the code. IIRC there was an explicit check if RF  N, 
but I cannot find it any more. I'm guessing we now rely on a normal 
UnavailableFailure if there are not enough UP nodes. 

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 16/03/2012, at 8:56 PM, Mikael Wikblom wrote:

 ok, thank you both for the clarification. So the correct approach would be to 
 bootstrap the new node and run repair on each of the nodes in the cluster.
 
 I'm a bit puzzled though, I just tried to increase R to 3 in a cluster with 
 N=2. It serves reads and writes without issues CL.one. Is the described 
 restriction is something that will be implemented in the future?
 
 Thank you
 Regards
 
 
 
 
 On 03/16/2012 03:07 AM, aaron morton wrote:
 
 The documentation is correct. 
 I was mistakenly remembering discussions in the past about RF  #nodes. 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 16/03/2012, at 4:34 AM, Doğan Çeçen wrote:
 
 I'm not sure why this is not allowed. As long as I do not use CL.all there
 will be enough nodes available to satisfy the read / write (at least when I
 look at ReadCallback and the WriteResponseHandler). Or am I missing
 something here?
 
 According to 
 http://www.datastax.com/docs/1.0/cluster_architecture/replication
 
 As a general rule, the replication factor should not exceed the
 number of nodes in the cluster. However, it is possible to increase
 replication factor, and then add the desired number of nodes
 afterwards. When replication factor exceeds the number of nodes,
 writes will be rejected, but reads will be served as long as the
 desired consistency level can be met.
 
 -- 
 ()  ascii ribbon campaign - against html e-mail
 /\  www.asciiribbon.org   - against proprietary attachments
 
 
 
 -- 
 Mikael Wikblom
 Software Architect
 SiteVision AB
 019-217058
 mikael.wikb...@sitevision.se
 http://www.sitevision.se



Re: Bootstrapping a new node to a running cluster

2012-03-16 Thread Mikael Wikblom

ok, thank you for your time!

Cheers


On 03/16/2012 10:12 AM, aaron morton wrote:

I think your original plan is sound.

1. Up the RF to 4.
2. Add the node with auto_bootstrap true
3. Once bootrapping has finished the new node has all the data it needs.
4. Check for secondary index creation using describe in the CLI to see 
which are build. You can also see progress using nodetool compactionstats


I'm a bit puzzled though, I just tried to increase R to 3 in a 
cluster with N=2. It serves reads and writes without issues CL.one. 
Is the described restriction is something that will be implemented in 
the future?
I had a quick glance at the code. IIRC there was an explicit check if 
RF  N, but I cannot find it any more. I'm guessing we now rely on a 
normal UnavailableFailure if there are not enough UP nodes.


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 16/03/2012, at 8:56 PM, Mikael Wikblom wrote:

ok, thank you both for the clarification. So the correct approach 
would be to bootstrap the new node and run repair on each of the 
nodes in the cluster.


I'm a bit puzzled though, I just tried to increase R to 3 in a 
cluster with N=2. It serves reads and writes without issues CL.one. 
Is the described restriction is something that will be implemented in 
the future?


Thank you
Regards




On 03/16/2012 03:07 AM, aaron morton wrote:

The documentation is correct.
I was mistakenly remembering discussions in the past about RF  #nodes.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com http://www.thelastpickle.com/

On 16/03/2012, at 4:34 AM, Doğan Çeçen wrote:

I'm not sure why this is not allowed. As long as I do not use 
CL.all there
will be enough nodes available to satisfy the read / write (at 
least when I

look at ReadCallback and the WriteResponseHandler). Or am I missing
something here?


According to 
http://www.datastax.com/docs/1.0/cluster_architecture/replication


As a general rule, the replication factor should not exceed the
number of nodes in the cluster. However, it is possible to increase
replication factor, and then add the desired number of nodes
afterwards. When replication factor exceeds the number of nodes,
writes will be rejected, but reads will be served as long as the
desired consistency level can be met.

--
()  ascii ribbon campaign - against html e-mail
/\ www.asciiribbon.org http://www.asciiribbon.org/   - against 
proprietary attachments





--
Mikael Wikblom
Software Architect
SiteVision AB
019-217058
mikael.wikb...@sitevision.se
http://www.sitevision.se





--
Mikael Wikblom
Software Architect
SiteVision AB
019-217058
mikael.wikb...@sitevision.se
http://www.sitevision.se



Re: Datastax Enterprise mixed workload cluster configuration

2012-03-16 Thread Alexandru Sicoe
Hi,

Since this thread already contains the system setup, I just want to ask
another question:

If you have 3 data centers (DC1,DC2 and DC3) and you have a keyspace where
the strategy options are such that each DC gets one replica. If you only
write to the nodes in one DC1 what is the path the replicas take assuming
you're correctly interleaved and evenly spaced the tokens of all the nodes?
If you write a record in a node in DC1 will it replicate it to the node in
DC2 and the node in DC2 will replicate it to the node in DC3? Or will the
node in DC1 replicate the record both to DC2 and DC3?

Cheers,
Alex

On Thu, Mar 15, 2012 at 11:26 PM, Alexandru Sicoe adsi...@gmail.com wrote:

 Sorry for that last message, I was confused because I thought I needed to
 use the DseSimpleSnitch but of course I can use the PropertyFileSnitch and
 that allows me to get the configuration with 3 data centers explained.

 Cheers,
 Alex


 On Thu, Mar 15, 2012 at 10:56 AM, Alexandru Sicoe adsi...@gmail.comwrote:

 Thanks Tyler,
  I see that cassandra.yaml has endpoint_snitch:
 com.datastax.bdp.snitch.DseSimpleSnitch. Will this pick up the
 configuration from the cassandra-topology.properties file as does the
 PropertyFileSnitch ? Or is there some other way of telling it which nodes
 are in withc DC?

 Cheers,
 Alex


 On Wed, Mar 14, 2012 at 9:09 PM, Tyler Hobbs ty...@datastax.com wrote:

 Yes, you can do this.

 You will want to have three DCs: DC1 with [1, 2, 3], DC2 with [4, 5, 6],
 and DC3 with [7, 8, 9].  For your normal data keyspace, the replication
 strategy should be NTS, and the strategy_options should have some replicas
 in each of the three DCs.  For example: {DC1: 3, DC2: 3, DC3: 3} if you
 need that level of replication in each one (although you probably only want
 an RF of 1 for DC3).

 Your clients that are performing writes should only open connections
 against the nodes in DC1, and you should write at CL.ONE or
 CL.LOCAL_QUORUM.  Likewise for reads, your clients should only connect to
 nodes in DC2, and you should read at CL.ONE or CL.LOCAL_QUORUM.

 The nodes in DC3 should run as analytics nodes.  I believe the default
 CL for m/r jobs is ONE, which would work.

 As far as tokens go, interleaving all three DCs and evenly spacing the
 tokens will work.  For example, the ordering of your nodes might be [1, 4,
 7, 2, 5, 8, 3, 6, 9].


 On Wed, Mar 14, 2012 at 12:05 PM, Alexandru Sicoe adsi...@gmail.comwrote:

 Hi everyone,
  I want to test out the Datastax Enterprise software to have a mixed
 workload setup with an analytics and a real time part.

  However I am not sure how to configure it to achieve what I want: I
 will have 3 real machines on one side of a gateway (1,2,3) and 6 VMs on
 another(4,5,6).
  1,2,3 will each have a normal Cassandra node that just takes data
 directly from my data sources. I want them to replicate the data to the
 other 6 VMs. Now, out of those 6 VMs 4,5,6 will run normal Cassandra nodes
 and 7,8,9 will run Analytics nodes. So I only want to write to the 1,2,3
 and I only want to serve user reads from 4,5,6 and do analytics on 7,8,9.
 Can I achieve this by configuring 1,2,3,4,5,6 as normal nodes and the rest
 as analytics nodes? If I alternate the tokens as it's explained in
 http://www.datastax.com/docs/1.0/datastax_enterprise/init_dse_cluster#init-dseis
  it analoguous to achieving something like 3 DCs each getting their own
 replica?

 Thanks,
 Alex




 --
 Tyler Hobbs
 DataStax http://datastax.com/






Re: CASSANDRA-2388 - ColumnFamilyRecordReader fails for a given split because a host is down

2012-03-16 Thread Mick Semb Wever
Sorry for such a late reply. I'm not always keeping up with the mailing
list.

 Is the following scenario covered by 2388? I have a test cluster of 6
 nodes with a replication factor of 3. Each server can execute hadoop
 tasks. 1 cassandra node is down for the test.
 
 The job is kicked off from node 1 jobtracker.
 A task is executed from node 1, and fails because the local cassandra
 instance is down
 retry on node 6, this tries to connect to node 1 and fails
 retry on node 5, this tries to connect to node 1 and fails
 retry on node 4, this tries to connect to node 1 and fails
 After 4 failures the task is killed and the job fails.
 
 Node 2 and 3 which contain the other replicas never run the task. The
 node selection seems to be random. I can modify the cassandra code to
 check connectivity in ColumnFamilyRecordReader but I suspect this is
 fixing the wrong problem.

There are two problems here.

1) hadoop's jobtracker isn't preferencing tasks to tasktracker that
would provide data locality.

2) connection replica nodes are never attempted directly, instead the
task must fail and be re-submitted to another tasktracker which
hopefully is a replica node.

 [snip] but this comment from mck seems to say it should work
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/%
 3C1315253057.7466.222.camel@localhost%3E

not in your case. 
ColumnFamilyInputFormat splits the query into InputSplits. This is done
via the api calls describe_ring and describe_splits. These InputSplits
(ColumnFamilySplit) each has a list of locations which are the replica
nodes.

Now hadoop is supposed to preference sending tasks to tasktrackers based
on the split's location. This is problem (1). I haven't seen it actually
work. The closest information i got is
http://abel-perez.com/hadoop-task-assignment

Problem (2) is ColumnFamilyRecordReader.getLocation() returns you the
address from the list of locations for the current split that matches
the localhost. This preferences data locality. If none of the locations
is local then it simply returns the first location in the list. This
explains your use case not working. One fix for you to experiment with
is to increase the allowed task failures (i think it is
mapred.max.tracker.failures) to the number of nodes you have. Then each
node would be (randomly) tried before the task killed and job failed.

~mck


-- 
Friendship with the upright, with the truthful and with the well
informed is beneficial. Friendship with those who flatter, with those
who are meek and who compromise with principles, and with those who talk
cleverly is harmful. Confucius 

| http://github.com/finn-no | http://tech.finn.no |



signature.asc
Description: This is a digitally signed message part


Re: Single Node Cassandra Installation

2012-03-16 Thread Thomas van Neerijnen
You'll need to either read or write at at least quorum to get consistent
data from the cluster so you may as well do both.
Now that you mention it, I was wrong about downtime, with a two node
cluster reads or writes at quorum will mean both nodes need to be online.
Perhaps you could have an emergency switch in your application which flips
to consistency of 1 if one of your Cassandra servers goes down? Just make
sure it's set back to quorum when the second one returns or again you could
end up with inconsistent data.

On Fri, Mar 16, 2012 at 2:04 AM, Drew Kutcharian d...@venarc.com wrote:

 Thanks for the comments, I guess I will end up doing a 2 node cluster with
 replica count 2 and read consistency 1.

 -- Drew



 On Mar 15, 2012, at 4:20 PM, Thomas van Neerijnen wrote:

 So long as data loss and downtime are acceptable risks a one node cluster
 is fine.
 Personally this is usually only acceptable on my workstation, even my dev
 environment is redundant, because servers fail, usually when you least want
 them to, like for example when you've decided to save costs by waiting
 before implementing redundancy. Could a failure end up costing you more
 than you've saved? I'd rather get cheaper servers (maybe even used off
 ebay??) so I could have at least two of them.

 If you do go with a one node solution, altho I haven't tried it myself
 Priam looks like a good place to start for backups, otherwise roll your own
 with incremental snapshotting turned on and a watch on the snapshot
 directory. Storage on something like S3 or Cloud Files is very cheap so
 there's no good excuse for no backups.

 On Thu, Mar 15, 2012 at 7:12 PM, R. Verlangen ro...@us2.nl wrote:

 Hi Drew,

 One other disadvantage is the lack of consistency level and
 replication. Both ware part of the high availability / redundancy. So you
 would really need to backup your single-node-cluster to some other
 external location.

 Good luck!


 2012/3/15 Drew Kutcharian d...@venarc.com

 Hi,

 We are working on a project that initially is going to have very little
 data, but we would like to use Cassandra to ease the future scalability.
 Due to budget constraints, we were thinking to run a single node Cassandra
 for now and then add more nodes as required.

 I was wondering if it is recommended to run a single node cassandra in
 production? Are there any other issues besides lack of high availability?

 Thanks,

 Drew







Re: 1.0.8 with Leveled compaction - Possible issues

2012-03-16 Thread Johan Elmerfjord
Perfect.. this helped a lot - and I can confirm that I have run in to
the same issue as described in:
http://mail-archives.apache.org/mod_mbox/cassandra-user/201203.mbox/%
3CCALqbeQbQ=d-hORVhA-LHOo_a5j46fQrsZMm+OQgfkgR=4rr...@mail.gmail.com%3E

Where it goes down when it tries to move up files to a higher level -
that is out of bounds.

Nice that I could get a overview of the levels by looking in
the .json-file as well.

Any timeframe on when we can expect 1.0.9 to be released?

/Johan


-- 
  
Johan Elmerfjord | Sr. Systems Administration/Mgr, EMEA | Adobe Systems,
Product Technical Operations | p. +45 3231 6008 | x86008 | cell. +46 735
101 444 | jelme...@adobe.com 

On Thu, 2012-03-15 at 17:00 -0700, Watanabe Maki wrote:
 update column family with LCS option + upgradesstables should convert
 all of your sstables.
 Set lig4j config:
 org.apache.cassandra.db.compaction=DEBUG
 in conf/log4j-server.properties and retry your procedure to find what
 is happen.
 
 
 maki
 
 
 
 On 2012/03/16, at 7:05, Johan Elmerfjord jelme...@adobe.com wrote:
 
 
 
  Hi, I'm testing the community-version of Cassandra 1.0.8.
  We are currently on 0.8.7 in our production-setup.
  
  We have 3 Column Families that each takes between 20 and 35 GB on
  disk per node. (8*2 nodes total)
  We would like to change to Leveled Compaction - and even try
  compression as well to reduce the space needed for compactions.
  We are running on SSD-drives as latency is a key-issue.
  
  As test I have imported one Column Family from 3 production-nodes to
  a 3 node test-cluster.
  The data on the 3 nodes ranges from 19-33GB. (with at least one
  large SSTable (Tiered size - recently compacted)).
  
  After loading this data to the 3 test-nodes, and running scrub and
  repair, I took a backup of the data so I have good test-set of data
  to work on.
  Then I changed changed to leveled compaction, using the
  cassandra-cli:
  
  UPDATE COLUMN FAMILY TestCF1 WITH
  compaction_strategy=LeveledCompactionStrategy;
  I could see the change being written to the logfile on all nodes.
  
  Then I don't know for for sure if I need to run anything else to
  make the change happen - or if it's just to wait.
  My test-cluster does not receive new data.
  
  For this  KS  CF and on each of the nodes I have tried some or
  several of: upgradesstable, scrub, compact, cleanup and repair -
  each task taking between 40 minutes and 4 hours.
  With the exception of compact that returns almost immediately with
  no visible compactions made.
  
  On some node I ended up with over 3 files with the default 5MB
  size for leveled compaction, on another node it didn't look like
  anything has been done and I still have a 19GB SSTable.
  
  I then made another change.
  UPDATE COLUMN FAMILY TestCF1 WITH
  compaction_strategy=LeveledCompactionStrategy AND
  compaction_strategy_options=[{sstable_size_in_mb: 64}];
  WARNING: [{}] strategy_options syntax is deprecated, please use {}
  Which is probably wrong in the documentation - and should be:
  UPDATE COLUMN FAMILY TestCF1 WITH
  compaction_strategy=LeveledCompactionStrategy AND
  compaction_strategy_options={sstable_size_in_mb: 64};
  
  I think that we will be able to find the data in 3 searches with a
  64MB size - and still only use around 700MB while doing compactions
  - and keep the number of files ~3000 per CF.
  
  A few days later it looks like I still have a mix between original
  huge SStables, 5MB once - and some nodes has 64MB files as well.
  Do I need to do something special to clean this up?
  I have tried another scrub /upgradesstables/clean - but nothing
  seems to do any change to me.
  
  Finally I have also tried to enable compression:
  UPDATE COLUMN FAMILY TestCF1 WITH
  compression_options=[{sstable_compression:SnappyCompressor,
  chunk_length_kb:64}];
  - which results in the same [{}] - warning.
  
  As you can see below - this created CompressionInfo.db - files on
  some nodes - but not on all.
  
  Is there a way I can force Teired sstables to be converted into
  Leveled once - and then to compression as well?
  Why are the original file (Tiered Sized SSTables still present on
  testnode1 - when is it supposed to delete them?
  
  Can I change back and forth between compression (on/off - or
  chunksizes) - and between Leveled vs Size Tiered compaction?
  Is there a way to see if the node is done - or waiting for
  something?
  When is it safe to apply another setting - does it have to complete
  one reorg before moving on to the next?
  
  Any input or own experiences are warmly welcome.
  
  Best regards, Johan
  
  
  Some lines of example directory-listings below.:
  
  Some files for testnode 3. (looks like it's still have the original
  Size Tiered files around, and a mixture of compressed 64MB files -
  and 5MB files? 
  
  total 19G
  drwxr-xr-x 3 cass cass 4.0K Mar 13 17:11 snapshots
  -rw-r--r-- 1 cass cass 6.0G Mar 13 18:42 TestCF1-hc-6346-Index.db
  -rw-r--r-- 1 cass cass 1.3M 

RE: Large hints column family

2012-03-16 Thread Bryce Godfrey
I took the reset the world approach, things are much better now and the hints 
table is staying empty.  Bit disconcerting that it could get so large and not 
be able to recover itself, but at least there was a solution.  Thanks


From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Thursday, March 15, 2012 7:24 PM
To: user@cassandra.apache.org
Subject: Re: Large hints column family

These messages make it look like the node is having trouble delivering hints.
INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 
284) Endpoint /192.168.20.4 died before hint delivery, aborting
INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 
354) Timed out replaying hints to /192.168.20.3; aborting further deliveries

Take another look at the logs on this machine and on 20.4 and 20.3.

I would be looking int why so many hints are been stored. GC ? are there also 
logs about dropped messages ?

If you want to reset the world, make sure the nodes have all run repair and 
then drop the hints. Either via JMX or stopped in the node and deleting the 
files on disk.

Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 16/03/2012, at 12:58 PM, Bryce Godfrey wrote:


We were having some occasional memory pressure issues, but we just added some 
more RAM a few days ago to the nodes and things are running more smoothly now, 
but in general nodes have not been going up and down.

I tried to do a list HintsColumnFamily from Cassandra-cli and it locks my 
Cassandra node and never returns, forcing me to kill the Cassandra process and 
restart it to get the node back.

Here is my settings which I believe are default since I don't remember changing 
them:

hinted_handoff_enabled: true
max_hint_window_in_ms: 360 # one hour
hinted_handoff_throttle_delay_in_ms: 50

Greping for Hinted in system log I get these
INFO [HintedHandoff:1] 2012-03-13 16:13:22,215 HintedHandOffManager.java (line 
373) Finished hinted handoff of 852703 rows to endpoint /192.168.20.3
INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 
284) Endpoint /192.168.20.4 died before hint delivery, aborting
INFO [ScheduledTasks:1] 2012-03-13 16:15:32,569 StatusLogger.java (line 65) 
HintedHandoff 1 1 0
INFO [HintedHandoff:1] 2012-03-13 16:15:44,362 HintedHandOffManager.java (line 
296) Started hinted handoff for token: 113427455640312814857969558651062452224 
with IP: /192.168.20.3
INFO [HintedHandoff:1] 2012-03-13 16:21:37,266 HintedHandOffManager.java (line 
296) Started hinted handoff for token: 113427455640312814857969558651062452224 
with IP: /192.168.20.3
INFO [ScheduledTasks:1] 2012-03-13 16:23:07,662 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-13 16:25:49,330 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-13 16:30:52,503 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-13 16:42:22,202 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 
354) Timed out replaying hints to /192.168.20.3; aborting further deliveries
INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 ColumnFamilyStore.java (line 
704) Enqueuing flush of Memtable-HintsColumnFamily@661547256(34298224/74465815 
serialized/live bytes, 78808 ops)
INFO [HintedHandoff:1] 2012-03-13 17:11:00,098 HintedHandOffManager.java (line 
373) Finished hinted handoff of 44160 rows to endpoint /192.168.20.3
INFO [HintedHandoff:1] 2012-03-13 17:11:36,596 HintedHandOffManager.java (line 
296) Started hinted handoff for token: 56713727820156407428984779325531226112 
with IP: /192.168.20.4
INFO [ScheduledTasks:1] 2012-03-13 17:12:25,248 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [HintedHandoff:1] 2012-03-13 18:47:56,151 HintedHandOffManager.java (line 
296) Started hinted handoff for token: 113427455640312814857969558651062452224 
with IP: /192.168.20.3
INFO [ScheduledTasks:1] 2012-03-13 18:50:24,326 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-14 12:12:48,177 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-14 12:13:57,685 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-14 12:14:57,258 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-14 12:14:58,260 StatusLogger.java (line 65) 
HintedHandoff 1 2 0
INFO [ScheduledTasks:1] 2012-03-14 12:15:59,093 StatusLogger.java (line 65) 
HintedHandoff   

Order rows numerically

2012-03-16 Thread A J
If I define my rowkeys to be Integer
(key_validation_class=IntegerType) , how can I order the rows
numerically ?
ByteOrderedPartitioner orders lexically and retrieval using get_range
does not seem to make sense in order.

If I were to change rowkey to be UTF8 (key_validation_class=UTF8Type),
BOP still does not give numerical enough.
For range of rowkey from 1 to 2, I get 1, 10,11.,2 (lexical ordering).

Any workaround for this ?

Thanks.


RE: 0.8.1 Vs 1.0.7

2012-03-16 Thread Jeremiah Jordan
I would guess more aggressive compaction settings, did you update rows or 
insert some twice?
If you run major compaction a couple times on the 0.8.1 cluster does the data 
size get smaller?

You can use the describe command to check if compression got turned on.

-Jeremiah


From: Ravikumar Govindarajan [ravikumar.govindara...@gmail.com]
Sent: Thursday, March 15, 2012 4:41 AM
To: user@cassandra.apache.org
Subject: 0.8.1 Vs 1.0.7

Hi,

I ran some data import tests for cassandra 0.8.1 and 1.0.7. The results were a 
little bit surprising

0.8.1, SimpleStrategy, Rep_Factor=3,QUORUM Writes, RP, SimpleSnitch

XXX.XXX.XXX.A  datacenter1 rack1   Up Normal  140.61 GB   12.50%
XXX.XXX.XXX.B  datacenter1 rack1   Up Normal  139.92 GB   12.50%
XXX.XXX.XXX.C  datacenter1 rack1   Up Normal  138.81 GB   12.50%
XXX.XXX.XXX.D  datacenter1 rack1   Up Normal  139.78 GB   12.50%
XXX.XXX.XXX.E  datacenter1 rack1   Up Normal  137.44 GB   12.50%
XXX.XXX.XXX.F  datacenter1 rack1   Up Normal  138.48 GB   12.50%
XXX.XXX.XXX.G  datacenter1 rack1   Up Normal  140.52 GB   12.50%
XXX.XXX.XXX.H  datacenter1 rack1   Up Normal  145.24 GB   12.50%

1.0.7, NTS, Rep_Factor{DC1:3, DC2:2}, LOCAL_QUORUM writes, RP [DC2 m/c yet to 
join ring],
PropertyFileSnitch

XXX.XXX.XXX.A  DC1 RAC1   Up Normal   48.72  GB   12.50%
XXX.XXX.XXX.B  DC1 RAC1   Up Normal   51.23  GB   12.50%
XXX.XXX.XXX.C  DC1 RAC1   Up Normal   52.4GB   12.50%
XXX.XXX.XXX.D  DC1 RAC1   Up Normal   49.64  GB   12.50%
XXX.XXX.XXX.E  DC1 RAC1   Up Normal   48.5GB   12.50%
XXX.XXX.XXX.F  DC1 RAC1   Up Normal53.38  GB   12.50%
XXX.XXX.XXX.G  DC1 RAC1   Up Normal   51.11  GB   12.50%
XXX.XXX.XXX.H  DC1 RAC1   Up Normal   53.36  GB   12.50%

There seems to be 3X savings in size for the same dataset running 1.0.7. I have 
not enabled compression for any of the CFs. Will it be enabled by default when 
creating a new CF in 1.0.7? cassandra.yaml is also mostly identical.

Thanks and Regards,
Ravi


Re: Single Node Cassandra Installation

2012-03-16 Thread Ben Coverston
Doing reads and writes at CL=1 with RF=2 N=2 does not imply that the reads
will be inconsistent. It's more complicated than the simple counting of
blocked replicas. It is easy to support the notion that it will be largely
consistent, in fact very consistent for most use cases.

By default Cassandra tries to write to both nodes, always. Writes will only
fail (on a node) if it is down, and even then hinted handoff will attempt
to keep both nodes in sync when the troubled node comes back up. The point
of having two nodes is to have read and write availability in the face of
transient failure.

If you are interested there is a good exposition of what 'consistency'
means in a system like Cassandra from the link below[1].

[1]
http://www.eecs.berkeley.edu/~pbailis/projects/pbs/


On Fri, Mar 16, 2012 at 6:50 AM, Thomas van Neerijnen
t...@bossastudios.comwrote:

 You'll need to either read or write at at least quorum to get consistent
 data from the cluster so you may as well do both.
 Now that you mention it, I was wrong about downtime, with a two node
 cluster reads or writes at quorum will mean both nodes need to be online.
 Perhaps you could have an emergency switch in your application which flips
 to consistency of 1 if one of your Cassandra servers goes down? Just make
 sure it's set back to quorum when the second one returns or again you could
 end up with inconsistent data.


 On Fri, Mar 16, 2012 at 2:04 AM, Drew Kutcharian d...@venarc.com wrote:

 Thanks for the comments, I guess I will end up doing a 2 node cluster
 with replica count 2 and read consistency 1.

 -- Drew



 On Mar 15, 2012, at 4:20 PM, Thomas van Neerijnen wrote:

 So long as data loss and downtime are acceptable risks a one node cluster
 is fine.
 Personally this is usually only acceptable on my workstation, even my dev
 environment is redundant, because servers fail, usually when you least want
 them to, like for example when you've decided to save costs by waiting
 before implementing redundancy. Could a failure end up costing you more
 than you've saved? I'd rather get cheaper servers (maybe even used off
 ebay??) so I could have at least two of them.

 If you do go with a one node solution, altho I haven't tried it myself
 Priam looks like a good place to start for backups, otherwise roll your own
 with incremental snapshotting turned on and a watch on the snapshot
 directory. Storage on something like S3 or Cloud Files is very cheap so
 there's no good excuse for no backups.

 On Thu, Mar 15, 2012 at 7:12 PM, R. Verlangen ro...@us2.nl wrote:

 Hi Drew,

 One other disadvantage is the lack of consistency level and
 replication. Both ware part of the high availability / redundancy. So you
 would really need to backup your single-node-cluster to some other
 external location.

 Good luck!


 2012/3/15 Drew Kutcharian d...@venarc.com

 Hi,

 We are working on a project that initially is going to have very little
 data, but we would like to use Cassandra to ease the future scalability.
 Due to budget constraints, we were thinking to run a single node Cassandra
 for now and then add more nodes as required.

 I was wondering if it is recommended to run a single node cassandra in
 production? Are there any other issues besides lack of high availability?

 Thanks,

 Drew








-- 
Ben Coverston
DataStax -- The Apache Cassandra Company


cassandra-cli and uncreachable status confusion

2012-03-16 Thread Shoaib Mir
Hi guys,

While creating schema on our cluster today I didn't get any errors even
when some of the hosts in the cluster were unreachable (not the ones in the
same data centre but in another region). cli kept on showing all nodes
agreeing where all nodes were agreeing.

Now after this when I did describe cluster I did get appropriate
unreachable messages for the nodes that were timing out on connections.

Can someone please explain if at the time of schema creation does the nodes
just talk to other nodes within the DC for agreement or it has to talk to
each and every node within the whole cluster before agreeing on schema
changes?

cheers,
Shoaib


Re: 1.0.8 with Leveled compaction - Possible issues

2012-03-16 Thread Watanabe Maki
The Cassandra team has been released new version every month last half year.
So I anticipate they will release 1.0.9 before April. Just my forecast:-)

maki


On 2012/03/16, at 22:41, Johan Elmerfjord jelme...@adobe.com wrote:

 Perfect.. this helped a lot - and I can confirm that I have run in to the 
 same issue as described in:
 http://mail-archives.apache.org/mod_mbox/cassandra-user/201203.mbox/%3CCALqbeQbQ=d-hORVhA-LHOo_a5j46fQrsZMm+OQgfkgR=4rr...@mail.gmail.com%3E
 
 Where it goes down when it tries to move up files to a higher level - that is 
 out of bounds.
 
 Nice that I could get a overview of the levels by looking in the .json-file 
 as well.
 
 Any timeframe on when we can expect 1.0.9 to be released?
 
 /Johan
 
 
 -- 
   
 Johan Elmerfjord | Sr. Systems Administration/Mgr, EMEA | Adobe Systems, 
 Product Technical Operations | p. +45 3231 6008 | x86008 | cell. +46 735 101 
 444 | jelme...@adobe.com 
 
 On Thu, 2012-03-15 at 17:00 -0700, Watanabe Maki wrote:
 
 update column family with LCS option + upgradesstables should convert all of 
 your sstables.
 Set lig4j config:
 org.apache.cassandra.db.compaction=DEBUG
 in conf/log4j-server.properties and retry your procedure to find what is 
 happen.
 
 
 maki
 
 
 
 On 2012/03/16, at 7:05, Johan Elmerfjord jelme...@adobe.com wrote:
 
 
 
 Hi, I'm testing the community-version of Cassandra 1.0.8.
 We are currently on 0.8.7 in our production-setup.
 
 We have 3 Column Families that each takes between 20 and 35 GB on disk per 
 node. (8*2 nodes total)
 We would like to change to Leveled Compaction - and even try compression as 
 well to reduce the space needed for compactions.
 We are running on SSD-drives as latency is a key-issue.
 
 As test I have imported one Column Family from 3 production-nodes to a 3 
 node test-cluster.
 The data on the 3 nodes ranges from 19-33GB. (with at least one large 
 SSTable (Tiered size - recently compacted)).
 
 After loading this data to the 3 test-nodes, and running scrub and repair, 
 I took a backup of the data so I have good test-set of data to work on.
 Then I changed changed to leveled compaction, using the cassandra-cli:
 
 UPDATE COLUMN FAMILY TestCF1 WITH 
 compaction_strategy=LeveledCompactionStrategy;
 I could see the change being written to the logfile on all nodes.
 
 Then I don't know for for sure if I need to run anything else to make the 
 change happen - or if it's just to wait.
 My test-cluster does not receive new data.
 
 For this  KS  CF and on each of the nodes I have tried some or several of: 
 upgradesstable, scrub, compact, cleanup and repair - each task taking 
 between 40 minutes and 4 hours.
 With the exception of compact that returns almost immediately with no 
 visible compactions made.
 
 On some node I ended up with over 3 files with the default 5MB size for 
 leveled compaction, on another node it didn't look like anything has been 
 done and I still have a 19GB SSTable.
 
 I then made another change.
 UPDATE COLUMN FAMILY TestCF1 WITH 
 compaction_strategy=LeveledCompactionStrategy AND 
 compaction_strategy_options=[{sstable_size_in_mb: 64}];
 WARNING: [{}] strategy_options syntax is deprecated, please use {}
 Which is probably wrong in the documentation - and should be:
 UPDATE COLUMN FAMILY TestCF1 WITH 
 compaction_strategy=LeveledCompactionStrategy AND 
 compaction_strategy_options={sstable_size_in_mb: 64};
 
 I think that we will be able to find the data in 3 searches with a 64MB 
 size - and still only use around 700MB while doing compactions - and keep 
 the number of files ~3000 per CF.
 
 A few days later it looks like I still have a mix between original huge 
 SStables, 5MB once - and some nodes has 64MB files as well.
 Do I need to do something special to clean this up?
 I have tried another scrub /upgradesstables/clean - but nothing seems to do 
 any change to me.
 
 Finally I have also tried to enable compression:
 UPDATE COLUMN FAMILY TestCF1 WITH 
 compression_options=[{sstable_compression:SnappyCompressor, 
 chunk_length_kb:64}];
 - which results in the same [{}] - warning.
 
 As you can see below - this created CompressionInfo.db - files on some 
 nodes - but not on all.
 
 Is there a way I can force Teired sstables to be converted into Leveled 
 once - and then to compression as well?
 Why are the original file (Tiered Sized SSTables still present on testnode1 
 - when is it supposed to delete them?
 
 Can I change back and forth between compression (on/off - or chunksizes) - 
 and between Leveled vs Size Tiered compaction?
 Is there a way to see if the node is done - or waiting for something?
 When is it safe to apply another setting - does it have to complete one 
 reorg before moving on to the next?
 
 Any input or own experiences are warmly welcome.
 
 Best regards, Johan
 
 
 Some lines of example directory-listings below.:
 
 Some files for testnode 3. (looks like it's still have the original Size 
 Tiered files around, and a mixture of compressed 64MB 

Re: Order rows numerically

2012-03-16 Thread Watanabe Maki
How about to fill zeros before smaller digits?
Ex. 0001, 0002, etc

maki


On 2012/03/17, at 6:29, A J s5a...@gmail.com wrote:

 If I define my rowkeys to be Integer
 (key_validation_class=IntegerType) , how can I order the rows
 numerically ?
 ByteOrderedPartitioner orders lexically and retrieval using get_range
 does not seem to make sense in order.
 
 If I were to change rowkey to be UTF8 (key_validation_class=UTF8Type),
 BOP still does not give numerical enough.
 For range of rowkey from 1 to 2, I get 1, 10,11.,2 (lexical ordering).
 
 Any workaround for this ?
 
 Thanks.


Re: Order rows numerically

2012-03-16 Thread Dave Brosius
if your keys are 1-n and you are using BOP, then almost certainly your 
ring will be massively unbalanced with the first node getting clobbered. 
You'll have bigger issues than getting lexical ordering.


I'd try to rethink your design so that you don't need BOP.

On 03/16/2012 06:49 PM, Watanabe Maki wrote:

How about to fill zeros before smaller digits?
Ex. 0001, 0002, etc

maki


On 2012/03/17, at 6:29, A Js5a...@gmail.com  wrote:


If I define my rowkeys to be Integer
(key_validation_class=IntegerType) , how can I order the rows
numerically ?
ByteOrderedPartitioner orders lexically and retrieval using get_range
does not seem to make sense in order.

If I were to change rowkey to be UTF8 (key_validation_class=UTF8Type),
BOP still does not give numerical enough.
For range of rowkey from 1 to 2, I get 1, 10,11.,2 (lexical ordering).

Any workaround for this ?

Thanks.




Re: Question regarding secondary indices

2012-03-16 Thread Sanjeev Kulkarni
Thanks Aaron for the response. I see those logs.
I had one more question. Looks like sstableloader takes only one directory
at a time. Is it possible to load multiple directories in one call.
Something like sstableloader /drive1/keyspace1 /drive2/keyspace1...
This way one can take adv of the speedup that you get from reading accross
multiple drives.
Or alternatively is it possible to run multiple instances of sstableloader
on the same machine concurrently?
Thanks!

On Thu, Mar 15, 2012 at 6:54 PM, aaron morton aa...@thelastpickle.comwrote:

 You should see a log line with Index build of {} complete.

 You can also see which indexes are built using the describe command in
 cassandra-cli.


 -
 Aaron Morton[default@XX] describe;
 Keyspace: XX:
 ...
   Column Families:
 ColumnFamily: XXX
 ...
   Built indexes: []

 Cheers

 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 16/03/2012, at 10:04 AM, Sanjeev Kulkarni wrote:

 Hi,
 I'm using a 4 node cassandra cluster running 0.8.10 with rf=3. Its a brand
 new setup.
 I have a single col family which contains about 10 columns. I have enabled
 secondary indices on 3 of them. I used sstableloader to bulk load some data
 into this cluster.
 I poked around the logs and saw the following messages
 Submitting index build of attr_001 ..
 which indicates that cassandra has started building indices.
 How will I know when the building of the indices is done? Is there some
 log messages that I should look for?
 Thanks!