Re: Heavy writes ok for single node, but failed for cluster

2011-04-28 Thread Sheng Chen
Thank you for your advice. Rf=2 is a good work around.
I was using 0.7.4 and have updated to the latest 0.7 branch, which includes
2554 patch.
But it doesn't help. I still get lots of UnavailableException after the
following logs,

 INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228)
InetAddress /192.168.125.49 is now dead.
 INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609)
InetAddress /192.168.125.49 is now UP

 INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java
(line 304) Started hinted handoff for endpoint /192.168.125.49
 INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java
(line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49

It seems that the gossip failure detection is too sensitive. Is there any
configuration?






2011/4/27 Sylvain Lebresne sylv...@datastax.com

 On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen chensheng2...@gmail.com
 wrote:
  I succeeded to insert 1 billion records into a single node cassandra,
  bin/stress -d cas01 -o insert -n 10 -c 5 -S 34 -C5 -t 20
  Inserts finished in about 14 hours at a speed of 20k/sec.
  But when I added another node, tests always failed with
 UnavailableException
  in an hour.
  bin/stress -d cas01,cas02 -o insert -n 10 -c 5 -S 34 -C5 -t 20
  Writes speed is also 20k/sec because of the bottleneck in the client, so
 the
  pressure on each server node should be 50% of the single node test.
  Why couldn't they handle?
  By default, rf=1, consistency=ONE
  Some information that may be helpful,
  1. no warn/error in log file, the cluster is still alive after those
  exception
  2. the last logs on both nodes happen to be a compaction complete info
  3. gossip log shows one node is dead and then up again in 3 seconds

 That's your problem. Once marked down (and since rf=1), when an update for
 cas02 reach cas01 and cas01 has marked cas02 down, it will throw the
 UnavailableException.

 Now, it shouldn't have been marked down and I suspect this is due to
 https://issues.apache.org/jira/browse/CASSANDRA-2554
 (even though you didn't tell which version you're using, I suppose
 this is a 0.7.*).

 If you apply this patch or use the svn current 0.7 branch, that should
 hopefully
 not happen again.

 Note that if you had rf = 2, the node would still have been marked down
 wrongly
 for 3 seconds, but that would have been transparent to the stress test.

  4. I set hinted_handoff_enabled: false, but still see lots of handoff
 logs

 What are those saying ?

 --
 Sylvain



Re: Heavy writes ok for single node, but failed for cluster

2011-04-28 Thread Sheng Chen
   n/a  4067
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 82)
MessagingServicen/a   0,0
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,863 StatusLogger.java (line 86)
ColumnFamilyMemtable ops,data  Row cache size/cap  Key cache
size/cap
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89)
Keyspace1.Super1  0,0 0/0
 0/20
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89)
Keyspace1.Standard1   410835,20744126 0/0
 36172/20
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89)
system.IndexInfo  0,0 0/0
  0/1
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,864 StatusLogger.java (line 89)
system.LocationInfo  1,20 0/0
  2/2
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89)
system.Migrations 0,0 0/0
  0/2
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89)
system.HintsColumnFamily   206378,8667876 0/0
 0/26
 INFO [ScheduledTasks:1] 2011-04-29 07:09:59,865 StatusLogger.java (line 89)
system.Schema 0,0 0/0
  0/2
 INFO [ScheduledTasks:1] 2011-04-29 07:10:06,872 GCInspector.java (line 128)
GC for ParNew: 219 ms, 589477312 reclaimed leaving 4204479728 used; max is
8466202624
--

 INFO [GossipTasks:1] 2011-04-29 07:03:43,262 Gossiper.java (line 228)
InetAddress /192.168.125.51 is now dead.
 INFO [GossipStage:1] 2011-04-29 07:03:43,270 Gossiper.java (line 609)
InetAddress /192.168.125.51 is now UP
 --
 INFO [FlushWriter:1] 2011-04-29 07:03:41,061 Memtable.java (line 157)
Writing Memtable-Standard1@595540423(61557029 bytes, 1218975 operations)
 INFO [GossipStage:1] 2011-04-29 07:03:43,538 Gossiper.java (line 609)
InetAddress /192.168.125.49 is now UP
 INFO [FlushWriter:1] 2011-04-29 07:03:46,550 Memtable.java (line 172)
Completed flushing /data/cassandra/data/Keyspace1/Standard1-f-3991-Data.db
(76184729 bytes)



2011/4/28 Jonathan Ellis jbel...@gmail.com

 This means a node was too busy with something else to send out its
 heartbeat. Sometimes this is STW GC. Other times it is a bug (one was
 fixed for 0.7.6 in
 https://issues.apache.org/jira/browse/CASSANDRA-2554).

 On Thu, Apr 28, 2011 at 3:57 AM, Sheng Chen chensheng2...@gmail.com
 wrote:
  Thank you for your advice. Rf=2 is a good work around.
  I was using 0.7.4 and have updated to the latest 0.7 branch, which
 includes
  2554 patch.
  But it doesn't help. I still get lots of UnavailableException after the
  following logs,
   INFO [GossipTasks:1] 2011-04-28 16:12:17,661 Gossiper.java (line 228)
  InetAddress /192.168.125.49 is now dead.
   INFO [GossipStage:1] 2011-04-28 16:12:19,627 Gossiper.java (line 609)
  InetAddress /192.168.125.49 is now UP
   INFO [HintedHandoff:1] 2011-04-28 16:13:11,452 HintedHandOffManager.java
  (line 304) Started hinted handoff for endpoint /192.168.125.49
   INFO [HintedHandoff:1] 2011-04-28 16:13:11,453 HintedHandOffManager.java
  (line 360) Finished hinted handoff of 0 rows to endpoint /192.168.125.49
  It seems that the gossip failure detection is too sensitive. Is there any
  configuration?
 
 
 
 
 
  2011/4/27 Sylvain Lebresne sylv...@datastax.com
 
  On Wed, Apr 27, 2011 at 10:32 AM, Sheng Chen chensheng2...@gmail.com
  wrote:
   I succeeded to insert 1 billion records into a single node cassandra,
   bin/stress -d cas01 -o insert -n 10 -c 5 -S 34 -C5 -t 20
   Inserts finished in about 14 hours at a speed of 20k/sec.
   But when I added another node, tests always failed with
   UnavailableException
   in an hour.
   bin/stress -d cas01,cas02 -o insert -n 10 -c 5 -S 34 -C5 -t
 20
   Writes speed is also 20k/sec because of the bottleneck in the client,
 so
   the
   pressure on each server node should be 50% of the single node test.
   Why couldn't they handle?
   By default, rf=1, consistency=ONE
   Some information that may be helpful,
   1. no warn/error in log file, the cluster is still alive after those
   exception
   2. the last logs on both nodes happen to be a compaction complete info
   3. gossip log shows one node is dead and then up again in 3 seconds
 
  That's your problem. Once marked down (and since rf=1), when an update
 for
  cas02 reach cas01 and cas01 has marked cas02 down, it will throw the
  UnavailableException.
 
  Now, it shouldn't have been marked down and I suspect this is due to
  https://issues.apache.org/jira/browse/CASSANDRA-2554
  (even though you didn't tell which version you're using, I suppose
  this is a 0.7.*).
 
  If you apply this patch or use the svn current 0.7 branch, that should
  hopefully
  not happen again.
 
  Note that if you had rf = 2, the node would still have been marked down
  wrongly
  for 3

Heavy writes ok for single node, but failed for cluster

2011-04-27 Thread Sheng Chen
I succeeded to insert 1 billion records into a single node cassandra,
 bin/stress -d cas01 -o insert -n 10 -c 5 -S 34 -C5 -t 20
Inserts finished in about 14 hours at a speed of 20k/sec.

But when I added another node, tests always failed with UnavailableException
in an hour.
 bin/stress -d cas01,cas02 -o insert -n 10 -c 5 -S 34 -C5 -t 20
Writes speed is also 20k/sec because of the bottleneck in the client, so the
pressure on each server node should be 50% of the single node test.
Why couldn't they handle?

By default, rf=1, consistency=ONE

Some information that may be helpful,
1. no warn/error in log file, the cluster is still alive after those
exception
2. the last logs on both nodes happen to be a compaction complete info
3. gossip log shows one node is dead and then up again in 3 seconds
4. I set hinted_handoff_enabled: false, but still see lots of handoff logs


Re: Test idea on cassandra

2011-04-06 Thread Sheng Chen
Stress tools in contrib directory use multiple threads/processes.

2011/4/7 Mengchen Yu yum...@umail.iu.edu

 I'm trying to simulate a multi-user scenario. The reason why I
 want to use MPJ is to create different processes act like individual
 users. Do any one have idea how to do this clearly?
 Sorry for duplicated mails if any



Re: Compaction threshold does not save with nodetool

2011-04-06 Thread Sheng Chen
Thanks.

I think this feature could be clarified on the wiki.

2011/4/6 Dan Hendry dan.hendry.j...@gmail.com

 There are two layers of settings, the default, cluster wide, settings part
 of the schema and exposed/modifiable via the cli and individual settings
 exposed/modifiable via JMX and nodetool. Using nodetool, you are only
 modifying the in memory settings for a single node, changes to those
 settings are not persisted or reflected in other nodes.



 If you want a particular setting to be persisted across a restart (and
 applied to other nodes when they restart), you have to use the cli and the
 ‘update column family X with min_compaction_threshold=Y and
 max_compaction_threshold=X’ command.



 Dan



 *From:* Sheng Chen [mailto:chensheng2...@gmail.com]
 *Sent:* April-06-11 1:42
 *To:* user@cassandra.apache.org
 *Subject:* Compaction threshold does not save with nodetool



 Cassandra 0.7.4



 # nodetool -h localhost getcompactionthreshold Keyspace1 Standard1

 min=4 max=32

 # nodetool -h localhost setcompactionthreshold Keyspace1 Standard1 0 0

 # nodetool -h localhost getcompactionthreshold Keyspace1 Standard1

 min=0 max=0



 Now the thresholds have changed on the JMX pannel, but in the cassandra-cli
 `show keyspaces`, it is still  4/32.

 After I restart cassandra, threshold by nodetool shows 4/32 again. The
 setting is lost.

 I tried to use nodetool flush to save the change but it doesn't work.



 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 9.0.894 / Virus Database: 271.1.1/3554 - Release Date: 04/06/11
 02:34:00



Re: Stress tests failed with secondary index

2011-04-06 Thread Sheng Chen
Thank you Aaron.

It does not seem to be an overload problem.

I have 16 cores and 48G ram on the single node, and I reduced the concurrent
threads to be 1.
Still, it just suddenly dies of a timeout, while the cpu, ram, disk load are
below 10% and write latency is about 0.5ms for the past 10 minutes which is
really fast.

No logs of dropped messages are found.





2011/4/7 aaron morton aa...@thelastpickle.com

 TimedOutException means that the less than CL number of nodes responded to
 the coordinator before the rpc_timeout.

 So it was overloaded. Which makes sense when you say it only happens with
 secondary indexes. Consider things like
 - reducing the throughput
 - reducing the number of clients
 - ensuring the clients are connecting to all nodes in the cluster.

 You will probably find some logs about dropped messages on some nodes.
 Aaron

 On 6 Apr 2011, at 20:39, Sheng Chen wrote:

  I used py_stress module to insert 10m test data with a secondary index.
  I got the following exceptions.
 
  # python stress.py -d xxx -o insert -n 1000 -c 5 -s 34 -C 5 -x keys
  total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
  265322,26532,26541,0.00186140829433,10
  630300,36497,36502,0.00129331431204,20
  986781,35648,35640,0.0013310986218,30
  1332190,34540,34534,0.00135942295893,40
  1473578,14138,14138,0.00142941070007,50
  Process Inserter-38:
  Traceback (most recent call last):
File /usr/lib64/python2.4/site-packages/multiprocessing/process.py,
 line 237, in _bootstrap
  self.run()
File stress.py, line 242, in run
  self.cclient.batch_mutate(cfmap, consistency)
File
 /root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py,
 line 784, in batch_mutate
  TimedOutException: TimedOutException(args=())
  self.run()
File stress.py, line 242, in run
  self.recv_batch_mutate()
File
 /root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py,
 line 810, in recv_batch_mutate
  raise result.te
 
 
  Tests without secondary index is ok at about 40k ops/sec.
 
  There is a `GC for ParNew` for about 200ms taking place every second.
 Does it matter?
  The same gc for about 400ms happens every 2 seconds, which does not hurt
 the inserts without secondary index.
 
  Thanks in advance for any advice.
 
  Sheng




Compaction threshold does not save with nodetool

2011-04-05 Thread Sheng Chen
Cassandra 0.7.4

# nodetool -h localhost getcompactionthreshold Keyspace1 Standard1
min=4 max=32
# nodetool -h localhost setcompactionthreshold Keyspace1 Standard1 0 0
# nodetool -h localhost getcompactionthreshold Keyspace1 Standard1
min=0 max=0

Now the thresholds have changed on the JMX pannel, but in the cassandra-cli
`show keyspaces`, it is still  4/32.
After I restart cassandra, threshold by nodetool shows 4/32 again. The
setting is lost.
I tried to use nodetool flush to save the change but it doesn't work.


Re: Endless minor compactions after heavy inserts

2011-04-03 Thread Sheng Chen
I think if i can keep a single sstable file in a proper size, the hot
data/index files may be able to fit into memory at least in some occasions.

In my use case, I want to use cassandra for storage of a large amount of log
data.
There will be multiple nodes, and each node has 10*2TB disks to hold as much
data as possible, ideally 20TB (about 100 billion rows) in one node.
Reading operations will be much less than writing. A reading latency within
1 second is acceptable.

Is it possible? Do you have advice on this design?
Thank you.

Sheng



2011/4/3 aaron morton aa...@thelastpickle.com

 With only one data file your reads would use the least amount of IO to find
 the data.

 Most people have multiple nodes and probably fewer disks, so each node may
 have a TB or two of data. How much capacity do your 10 disks give ? Will you
 be running multiple nodes in production ?

 Aaron



 On 2 Apr 2011, at 12:45, Sheng Chen wrote:

 Thank you very much.

 The major compaction will merge everything into one big file., which would
 be very large.
 Is there any way to control the number or size of files created by major
 compaction?
 Or, is there a recommended number or size of files for cassandra to handle?

 Thanks. I see the trigger of my minor compaction is OperationsInMillions.
 It is a number of operations in total, which I thought was in a second.

 Cheers,
 Sheng


 2011/4/1 aaron morton aa...@thelastpickle.com

 If you are doing some sort of bulk load you can disable minor compactions
 by setting the min_compaction_threshold and max_compaction_threshold to 0 .
 Then once your insert is complete run a major compaction via nodetool before
 turning the minor compaction back on.

 You can also reduce the compaction threads priority, see
 compaction_thread_priority in the yaml file.

 The memtable will be flushed when either the MB or ops throughput is
 triggered. If you are seeing a lot of memtables smaller than the MB
 threshold then the ops threshold is probably been triggered. Look for a log
 message at INFO level starting with Enqueuing flush of Memtable that will
 tell you how many bytes and ops the memtable had when it was flushed. Trying
 increasing the ops threshold and see what happens.

 You're change in the compaction threshold may not have an an effect
 because the compaction process was already running.

 AFAIK the best way to get the best out of your 10 disks will be to use a
 dedicated mirror for the commit log and a  stripe set for the data.

 Hope that helps.
 Aaron

 On 1 Apr 2011, at 14:52, Sheng Chen wrote:

  I've got a single node of cassandra 0.7.4, and I used the java stress
 tool to insert about 100 million records.
  The inserts took about 6 hours (45k inserts/sec) but the following minor
 compactions last for 2 days and the pending compaction jobs are still
 increasing.
 
  From jconsole I can read the MemtableThroughputInMB=1499,
 MemtableOperationsInMillions=7.0
  But in my data directory, I got hundreds of 438MB data files, which
 should be the cause of the minor compactions.
 
  I tried to set compaction threshold by nodetool, but it didn't seem to
 take effects (no change in pending compaction tasks).
  After restarting the node, my setting is lost.
 
  I want to distribute the read load in my disks (10 disks in xfs, LVM),
 so I don't want to do a major compaction.
  So, what can I do to keep the sstable file in a reasonable size, or to
 make the minor compactions faster?
 
  Thank you in advance.
  Sheng
 






Re: Endless minor compactions after heavy inserts

2011-04-01 Thread Sheng Chen
Thank you very much.

The major compaction will merge everything into one big file., which would
be very large.
Is there any way to control the number or size of files created by major
compaction?
Or, is there a recommended number or size of files for cassandra to handle?

Thanks. I see the trigger of my minor compaction is OperationsInMillions. It
is a number of operations in total, which I thought was in a second.

Cheers,
Sheng


2011/4/1 aaron morton aa...@thelastpickle.com

 If you are doing some sort of bulk load you can disable minor compactions
 by setting the min_compaction_threshold and max_compaction_threshold to 0 .
 Then once your insert is complete run a major compaction via nodetool before
 turning the minor compaction back on.

 You can also reduce the compaction threads priority, see
 compaction_thread_priority in the yaml file.

 The memtable will be flushed when either the MB or ops throughput is
 triggered. If you are seeing a lot of memtables smaller than the MB
 threshold then the ops threshold is probably been triggered. Look for a log
 message at INFO level starting with Enqueuing flush of Memtable that will
 tell you how many bytes and ops the memtable had when it was flushed. Trying
 increasing the ops threshold and see what happens.

 You're change in the compaction threshold may not have an an effect because
 the compaction process was already running.

 AFAIK the best way to get the best out of your 10 disks will be to use a
 dedicated mirror for the commit log and a  stripe set for the data.

 Hope that helps.
 Aaron

 On 1 Apr 2011, at 14:52, Sheng Chen wrote:

  I've got a single node of cassandra 0.7.4, and I used the java stress
 tool to insert about 100 million records.
  The inserts took about 6 hours (45k inserts/sec) but the following minor
 compactions last for 2 days and the pending compaction jobs are still
 increasing.
 
  From jconsole I can read the MemtableThroughputInMB=1499,
 MemtableOperationsInMillions=7.0
  But in my data directory, I got hundreds of 438MB data files, which
 should be the cause of the minor compactions.
 
  I tried to set compaction threshold by nodetool, but it didn't seem to
 take effects (no change in pending compaction tasks).
  After restarting the node, my setting is lost.
 
  I want to distribute the read load in my disks (10 disks in xfs, LVM), so
 I don't want to do a major compaction.
  So, what can I do to keep the sstable file in a reasonable size, or to
 make the minor compactions faster?
 
  Thank you in advance.
  Sheng
 




Re: newbie question: how do I know the total number of rows of a cf?

2011-03-31 Thread Sheng Chen
I just found an estmateKeys() method of the ColumnFamilyStoreMBean.
Is there any indication about how it works?

Sheng

2011/3/28 Sheng Chen chensheng2...@gmail.com

 Hi all,
 I want to know how many records I am holding in Cassandra, just like
 count(*) in sql.
 What can I do ? Thank you.

 Sheng





Endless minor compactions after heavy inserts

2011-03-31 Thread Sheng Chen
I've got a single node of cassandra 0.7.4, and I used the java stress tool
to insert about 100 million records.
The inserts took about 6 hours (45k inserts/sec) but the following minor
compactions last for 2 days and the pending compaction jobs are still
increasing.

From jconsole I can read the MemtableThroughputInMB=1499,
MemtableOperationsInMillions=7.0
But in my data directory, I got hundreds of 438MB data files, which should
be the cause of the minor compactions.

I tried to set compaction threshold by nodetool, but it didn't seem to take
effects (no change in pending compaction tasks).
After restarting the node, my setting is lost.

I want to distribute the read load in my disks (10 disks in xfs, LVM), so I
don't want to do a major compaction.
So, what can I do to keep the sstable file in a reasonable size, or to make
the minor compactions faster?

Thank you in advance.
Sheng


Compaction doubles disk space

2011-03-29 Thread Sheng Chen
I use 'nodetool compact' command to start a compaction.
I can understand that extra disk spaces are required during the compaction,
but after the compaction, the extra spaces are not released.

Before compaction:
SSTable count: 10
space used (live): 19G
space used (total): 21G

After compaction:
sstable count: 1
space used (live): 19G
space used (total): 42G


BTW, given that compaction requires double disk spaces, does it mean that I
should never reach half of my total disk space?
e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at all.


Re: Compaction doubles disk space

2011-03-29 Thread Sheng Chen
From a previous thread of the same topic, I used a force GC and the extra
spaces are released.

What about my second question?




2011/3/29 Sheng Chen chensheng2...@gmail.com

 I use 'nodetool compact' command to start a compaction.
 I can understand that extra disk spaces are required during the compaction,
 but after the compaction, the extra spaces are not released.

 Before compaction:
 SSTable count: 10
 space used (live): 19G
 space used (total): 21G

 After compaction:
 sstable count: 1
 space used (live): 19G
 space used (total): 42G


 BTW, given that compaction requires double disk spaces, does it mean that I
 should never reach half of my total disk space?
 e.g. if I have 505GB data on 1TB disk, I cannot even delete any data at
 all.







Re: Compaction doubles disk space

2011-03-29 Thread Sheng Chen
Yes.
I think at least we can remove the tombstones for each sstable first, and
then do the merge.

2011/3/29 Karl Hiramoto k...@hiramoto.org

 Would it be possible to improve the current compaction disk space issue by
  compacting one only a few SSTables at a time then imediately deleting the
 old one?  Looking at the logs it seems like deletions of old SSTables are
 taking longer than necessary.

 --
 Karl



Re: stress.py bug?

2011-03-22 Thread Sheng Chen
I am just wondering, why the stress test tools (python, java) need more
threads ?
Is the bottleneck of a single thread in the client, or in the server?
Thanks.

Sean

2011/3/22 Ryan King r...@twitter.com

 On Mon, Mar 21, 2011 at 4:02 AM, pob peterob...@gmail.com wrote:
  Hi,
  I'm inserting data from client node with stress.py to cluster of 6 nodes.
  They are all on 1Gbps network, max real throughput of network is 930Mbps
  (after measurement).
  python stress.py -c 1 -S 17  -d{6nodes}  -l3 -e QUORUM
   --operation=insert -i 1 -n 50 -t100
  The problem is stress.py show up it does avg ~750ops/sec what is 127MB/s,
  but the real throughput of network is ~116MB/s.

 You may need more concurrency in order to saturate your network.

 -ryan