commitlog replay missing data

2011-07-11 Thread Jeffrey Wang
Hey all,

 

Recently upgraded to 0.8.1 and noticed what seems to be missing data after a
commitlog replay on a single-node cluster. I start the node, insert a bunch
of stuff (~600MB), stop it, and restart it. There are log messages
pertaining to the commitlog replay and no errors, but some of the data is
missing. If I flush before stopping the node, everything is fine, and
running cfstats in the two cases shows different amounts of data in the
SSTables. Moreover, the amount of data that is missing is nondeterministic.
Has anyone run into this? Thanks.

 

Here is the output of a side-by-side diff between cfstats outputs for a
single CF before restarting (left) and after (right). Somehow a 37MB
memtable became a 2.9MB SSTable (note the difference in write count as
well)?

 

Column Family: Blocks   Column
Family: Blocks

SSTable count: 0  | SSTable
count: 1

Space used (live): 0  | Space used
(live): 2907637

Space used (total): 0 | Space used
(total): 2907637

Memtable Columns Count: 8198  | Memtable
Columns Count: 0

Memtable Data Size: 37550510  | Memtable
Data Size: 0

Memtable Switch Count: 0  | Memtable
Switch Count: 1

Read Count: 0   Read Count:
0

Read Latency: NaN ms.   Read
Latency: NaN ms.

Write Count: 8198 | Write Count:
1526

Write Latency: 0.018 ms.  | Write
Latency: 0.011 ms.

Pending Tasks: 0Pending
Tasks: 0

Key cache capacity: 20  Key cache
capacity: 20

Key cache size: 0   Key cache
size: 0

Key cache hit rate: NaN Key cache
hit rate: NaN

Row cache: disabled Row cache:
disabled

Compacted row minimum size: 0 | Compacted
row minimum size: 1110

Compacted row maximum size: 0 | Compacted
row maximum size: 2299

Compacted row mean size: 0| Compacted
row mean size: 1960

 

Note that I patched https://issues.apache.org/jira/browse/CASSANDRA-2317 in
my version, but there are no deletions involved so I don't think it's
relevant unless I messed something up while patching.

 

-Jeffrey



smime.p7s
Description: S/MIME cryptographic signature


hinted handoff sleeping

2011-06-23 Thread Jeffrey Wang
Hey all,

 

We're running a slightly patched version of 0.7.3 on a cluster of 5 nodes.
I've been noticing a number of messages in our logs which look like this
(after a node goes down and comes back up, usually just due to a GC):

 

2011-06-23 14:46:35,381 INFO [HintedHandoff:1]
org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHa
ndOffManager.java:290) {USER='',IP=''} - Sleeping 32649ms to stagger hint
delivery

 

The interesting thing is that we have hinted_handoff_enabled = false in the
YAML configuration, so it always says 0 rows are handed off (later, after
the sleep). Thus this sleeping seems quite wasteful. Is this part of the
code supposed to be reach even with hinted handoff disabled? Thanks.

 

-Jeffrey

 



smime.p7s
Description: S/MIME cryptographic signature


RE: hinted handoff sleeping

2011-06-23 Thread Jeffrey Wang
No, it's always been off. No hints are being delivered ever, but the 
HintedHandoffManager still does some stuff when nodes come back online.

-Jeffrey

-Original Message-
From: Ryan King [mailto:r...@twitter.com] 
Sent: Thursday, June 23, 2011 3:00 PM
To: user@cassandra.apache.org
Subject: Re: hinted handoff sleeping

On Thu, Jun 23, 2011 at 2:55 PM, Jeffrey Wang jw...@palantir.com wrote:
 Hey all,



 We’re running a slightly patched version of 0.7.3 on a cluster of 5 nodes.
 I’ve been noticing a number of messages in our logs which look like this
 (after a node goes “down” and comes back up, usually just due to a GC):



 2011-06-23 14:46:35,381 INFO [HintedHandoff:1]
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:290)
 {USER='',IP=''} - Sleeping 32649ms to stagger hint delivery



 The interesting thing is that we have hinted_handoff_enabled = false in the
 YAML configuration, so it always says 0 rows are handed off (later, after
 the sleep). Thus this sleeping seems quite wasteful. Is this part of the
 code supposed to be reach even with hinted handoff disabled? Thanks.

Did you previously run with HH on? That config setting prohibits new
hints from being created, but doesn't prevent existing ones from being
delivered.

-ryan


smime.p7s
Description: S/MIME cryptographic signature


multiple clusters communicating

2011-06-06 Thread Jeffrey Wang
Hey all,

 

We're seeing a strange issue in which two completely separate clusters
(0.7.3) on the same subnet (X.X.X.146 through X.X.X.150) with 3 machines
(146-148) and 2 machines (149-150). Both of them are seeded with the
respective machines in their cluster, yet when we run them they end up
gossiping with each other. They have different cluster names so they don't
merge, but this is quite annoying as schema changes don't actually go
through. Anyone have any ideas about this? Thanks.

 

-Jeffrey

 



smime.p7s
Description: S/MIME cryptographic signature


RE: pig + hadoop

2011-04-19 Thread Jeffrey Wang
Did you set PIG_RPC_PORT in your hadoop-env.sh? I was seeing this error for a 
while before I added that.

-Jeffrey

From: pob [mailto:peterob...@gmail.com]
Sent: Tuesday, April 19, 2011 6:42 PM
To: user@cassandra.apache.org
Subject: Re: pig + hadoop

Hey Aaron,

I read it, and all of 3 env variables was exported. The results are same.

Best,
P
2011/4/20 aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
Am guessing but here goes. Looks like the cassandra RPC port is not set, did 
you follow these steps in contrib/pig/README.txt

Finally, set the following as environment variables (uppercase,
underscored), or as Hadoop configuration variables (lowercase, dotted):
* PIG_RPC_PORT or cassandra.thrift.port : the port thrift is listening on
* PIG_INITIAL_ADDRESS or cassandra.thrift.address : initial address to connect 
to
* PIG_PARTITIONER or cassandra.partitioner.class : cluster partitioner

Hope that helps.
Aaron


On 20 Apr 2011, at 11:28, pob wrote:


Hello,

I did cluster configuration by http://wiki.apache.org/cassandra/HadoopSupport. 
When I run pig example-script.pig
-x local, everything is fine and i get correct results.

Problem is occurring with -x mapreduce

Im getting those errors :


2011-04-20 01:24:21,791 [main] ERROR org.apache.pig.tools.pigstats.PigStats - 
ERROR: java.lang.NumberFormatException: null
2011-04-20 01:24:21,792 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil 
- 1 map reduce job(s) failed!
2011-04-20 01:24:21,793 [main] INFO  org.apache.pig.tools.pigstats.PigStats - 
Script Statistics:

Input(s):
Failed to read data from cassandra://Keyspace1/Standard1

Output(s):
Failed to produce result in hdfs://ip:54310/tmp/temp-1383865669/tmp-1895601791

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201104200056_0005   -  null,
null-  null,
null


2011-04-20 01:24:21,793 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Failed!
2011-04-20 01:24:21,803 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1066: Unable to open iterator for alias topnames. Backend error : 
java.lang.NumberFormatException: null




thats from jobtasks web management - error  from task directly:

java.lang.RuntimeException: java.lang.NumberFormatException: null
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:123)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.initialize(PigRecordReader.java:176)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.NumberFormatException: null
at java.lang.Integer.parseInt(Integer.java:417)
at java.lang.Integer.parseInt(Integer.java:499)
at org.apache.cassandra.hadoop.ConfigHelper.getRpcPort(ConfigHelper.java:233)
at 
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:105)
... 5 more



Any suggestions where should be problem?

Thanks,





DatabaseDescriptor.defsVersion

2011-04-15 Thread Jeffrey Wang
Hey all,

I've been seeing a very rare issue with schema change conflicts on 0.7.3 (I am 
serializing all schema changes to a single Cassandra node and waiting for them 
to finish before continuing). Occasionally a node in the cluster will never 
report the correct schema, and I think it may have to do with synchronization 
on DatabaseDescriptor.defsVersion.

As far as I can tell, it is a static variable accessed by multiple threads but 
is not protected by synchronized/volatile. I was able to write a test in which 
one thread never reads the modification done by another thread (as is expected 
by an unsynchronized variable). Should this be fixed or is there a higher level 
reason this does not need to be synchronized (in which case I should continue 
looking for the reason why my schemas don't agree)? Thanks.

-Jeffrey



RE: DatabaseDescriptor.defsVersion

2011-04-15 Thread Jeffrey Wang
Done: https://issues.apache.org/jira/browse/CASSANDRA-2490

-Jeffrey

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Friday, April 15, 2011 7:39 PM
To: user@cassandra.apache.org
Cc: Jeffrey Wang
Subject: Re: DatabaseDescriptor.defsVersion

I think you found a bug; it should be volatile.  (Cassandra does
already make sure that only one change runs internally at a time.)

Can you create a ticket?

On Fri, Apr 15, 2011 at 6:04 PM, Jeffrey Wang jw...@palantir.com wrote:
 Hey all,



 I've been seeing a very rare issue with schema change conflicts on 0.7.3 (I
 am serializing all schema changes to a single Cassandra node and waiting for
 them to finish before continuing). Occasionally a node in the cluster will
 never report the correct schema, and I think it may have to do with
 synchronization on DatabaseDescriptor.defsVersion.



 As far as I can tell, it is a static variable accessed by multiple threads
 but is not protected by synchronized/volatile. I was able to write a test in
 which one thread never reads the modification done by another thread (as is
 expected by an unsynchronized variable). Should this be fixed or is there a
 higher level reason this does not need to be synchronized (in which case I
 should continue looking for the reason why my schemas don't agree)? Thanks.



 -Jeffrey





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


RE: pig counting question

2011-03-25 Thread Jeffrey Wang
I don't think it's Pig running out of memory, but rather Cassandra itself (the 
data doesn't even make it to Pig). get_range_slices() is called with a row 
batch size of 4096, the default, and it's fetching all of the columns in each 
row. If I have 10K columns in each row, that's a huge request, and Cassandra 
runs into memory pressure trying to serve it.

That's my understanding of it; if there's something I'm missing, please let me 
know.

-Jeffrey

-Original Message-
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
Sent: Friday, March 25, 2011 11:06 AM
To: user@cassandra.apache.org
Subject: Re: pig counting question

One thing I wonder though - if your columns are the thing that are increasing 
your heap size and eating up a lot of memory, and you're reading the data 
structure out as a bag of columns, why isn't pig spilling to disk instead of 
growing in memory.  The pig model is that you can have huge bags that don't 
kill you on memory but they are just slower because they spill to disk.  What 
is the schema that you impose when you load the data?

On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote:

 It looks like this functionality is not in the 0.7.3 version of 
 CassandraStorage. I tried to add the constructor which takes the limit to the 
 class, but I ran into some Pig parsing errors, so I had to make the parameter 
 a string. How did you get around this for the version of CassandraStorage in 
 trunk? I'm running Pig 0.8.0.
 
 Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra 
 starts eating up huge amounts of memory, maxing out my 16GB heap size. I 
 suspect this is because of the get_range_slices() call from 
 ColumnFamilyRecordReader. Are there plans to make this streaming/paged?
 
 -Jeffrey
 
 -Original Message-
 From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
 Sent: Thursday, March 24, 2011 11:34 AM
 To: user@cassandra.apache.org
 Subject: Re: pig counting question
 
 The limit defaults to 1024 but you can set it when you use CassandraStorage 
 in pig, like so:
 rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096);
 or whatever value you wish.
 
 Give that a try and see if it gives you more of what you're looking for.
 
 On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote:
 
 Hey all,
 
 I'm trying to run a very simple Pig script against my Cassandra cluster (5 
 nodes, 0.7.3). I've gotten it all set up and working, but the script is 
 giving me some strange results. Here is my script:
 
 rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage();
 rowct = FOREACH rows GENERATE $0, COUNT($1);
 dump rowct;
 
 If I understand Pig correctly, this should output (row name, column count) 
 tuples, but I'm always seeing 1024 for the column count even though the rows 
 have highly variable number of columns. Am I missing something? Thanks.
 
 -Jeffrey
 
 



RE: pig counting question

2011-03-25 Thread Jeffrey Wang
Just to be clear, it's also the case that if I have a Hadoop TaskTracker 
running on each node that Cassandra is running on, a map/reduce job will 
automatically handle data locality, right? I.e. each mapper will only read 
splits which live on the same box.

-Jeffrey

-Original Message-
From: Jeffrey Wang [mailto:jw...@palantir.com] 
Sent: Friday, March 25, 2011 11:42 AM
To: user@cassandra.apache.org
Subject: RE: pig counting question

I don't think it's Pig running out of memory, but rather Cassandra itself (the 
data doesn't even make it to Pig). get_range_slices() is called with a row 
batch size of 4096, the default, and it's fetching all of the columns in each 
row. If I have 10K columns in each row, that's a huge request, and Cassandra 
runs into memory pressure trying to serve it.

That's my understanding of it; if there's something I'm missing, please let me 
know.

-Jeffrey

-Original Message-
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
Sent: Friday, March 25, 2011 11:06 AM
To: user@cassandra.apache.org
Subject: Re: pig counting question

One thing I wonder though - if your columns are the thing that are increasing 
your heap size and eating up a lot of memory, and you're reading the data 
structure out as a bag of columns, why isn't pig spilling to disk instead of 
growing in memory.  The pig model is that you can have huge bags that don't 
kill you on memory but they are just slower because they spill to disk.  What 
is the schema that you impose when you load the data?

On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote:

 It looks like this functionality is not in the 0.7.3 version of 
 CassandraStorage. I tried to add the constructor which takes the limit to the 
 class, but I ran into some Pig parsing errors, so I had to make the parameter 
 a string. How did you get around this for the version of CassandraStorage in 
 trunk? I'm running Pig 0.8.0.
 
 Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra 
 starts eating up huge amounts of memory, maxing out my 16GB heap size. I 
 suspect this is because of the get_range_slices() call from 
 ColumnFamilyRecordReader. Are there plans to make this streaming/paged?
 
 -Jeffrey
 
 -Original Message-
 From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
 Sent: Thursday, March 24, 2011 11:34 AM
 To: user@cassandra.apache.org
 Subject: Re: pig counting question
 
 The limit defaults to 1024 but you can set it when you use CassandraStorage 
 in pig, like so:
 rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096);
 or whatever value you wish.
 
 Give that a try and see if it gives you more of what you're looking for.
 
 On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote:
 
 Hey all,
 
 I'm trying to run a very simple Pig script against my Cassandra cluster (5 
 nodes, 0.7.3). I've gotten it all set up and working, but the script is 
 giving me some strange results. Here is my script:
 
 rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage();
 rowct = FOREACH rows GENERATE $0, COUNT($1);
 dump rowct;
 
 If I understand Pig correctly, this should output (row name, column count) 
 tuples, but I'm always seeing 1024 for the column count even though the rows 
 have highly variable number of columns. Am I missing something? Thanks.
 
 -Jeffrey
 
 



RE: pig counting question

2011-03-24 Thread Jeffrey Wang
It looks like this functionality is not in the 0.7.3 version of 
CassandraStorage. I tried to add the constructor which takes the limit to the 
class, but I ran into some Pig parsing errors, so I had to make the parameter a 
string. How did you get around this for the version of CassandraStorage in 
trunk? I'm running Pig 0.8.0.

Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra starts 
eating up huge amounts of memory, maxing out my 16GB heap size. I suspect this 
is because of the get_range_slices() call from ColumnFamilyRecordReader. Are 
there plans to make this streaming/paged?

-Jeffrey

-Original Message-
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
Sent: Thursday, March 24, 2011 11:34 AM
To: user@cassandra.apache.org
Subject: Re: pig counting question

The limit defaults to 1024 but you can set it when you use CassandraStorage in 
pig, like so:
rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096);
or whatever value you wish.

Give that a try and see if it gives you more of what you're looking for.

On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote:

 Hey all,
  
 I'm trying to run a very simple Pig script against my Cassandra cluster (5 
 nodes, 0.7.3). I've gotten it all set up and working, but the script is 
 giving me some strange results. Here is my script:
  
 rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage();
 rowct = FOREACH rows GENERATE $0, COUNT($1);
 dump rowct;
  
 If I understand Pig correctly, this should output (row name, column count) 
 tuples, but I'm always seeing 1024 for the column count even though the rows 
 have highly variable number of columns. Am I missing something? Thanks.
  
 -Jeffrey
  



RE: running all unit tests

2011-03-15 Thread Jeffrey Wang
Awesome, thanks. I'm seeing some weird errors due to deleting commit logs, 
though (I'm running on Windows, which might have something to do with it):

[junit] java.io.IOException: Failed to delete C:\Documents and 
Settings\jwang\workspace-cass\Cassandra\Cassandra-0.7.0\build\test\cassandra\commitlog\CommitLog-1300214497376.log
[junit]   at 
org.apache.cassandra.io.util.FileUtils.deleteWithConfirm(FileUtils.java:54)
[junit]   at 
org.apache.cassandra.io.util.FileUtils.deleteRecursive(FileUtils.java:201)
[junit]   at 
org.apache.cassandra.io.util.FileUtils.deleteRecursive(FileUtils.java:197)
[junit]   at 
org.apache.cassandra.CleanupHelper.cleanup(CleanupHelper.java:55)
[junit]   at 
org.apache.cassandra.CleanupHelper.cleanupAndLeaveDirs(CleanupHelper.java:41)

Does anyone know how to get these to work?

-Jeffrey

From: aaron morton [mailto:aa...@thelastpickle.com]
Sent: Tuesday, March 15, 2011 1:26 AM
To: user@cassandra.apache.org
Subject: Re: running all unit tests

There is a test target in the build script.

Aron

On 15 Mar 2011, at 17:29, Jeffrey Wang wrote:


Hey all,

We're applying some patches to our own branch of Cassandra, and we are 
wondering if there is a good way to run all the unit tests. Just having JUnit 
run all the test classes seems to result in a lot of errors that are hard to 
fix, so I'm hoping there's an easy way to do this. Thanks!

-Jeffrey




running all unit tests

2011-03-14 Thread Jeffrey Wang
Hey all,

We're applying some patches to our own branch of Cassandra, and we are 
wondering if there is a good way to run all the unit tests. Just having JUnit 
run all the test classes seems to result in a lot of errors that are hard to 
fix, so I'm hoping there's an easy way to do this. Thanks!

-Jeffrey



get_range_slices perf

2011-03-13 Thread Jeffrey Wang
Hey all,

I'm trying to get a list of all the rows from a column family using 
get_range_slices retrieving no actual columns. I expected this operation to be 
pretty quick, but it seems to take a while (5-node 0.7.0 cluster takes 20 min 
to page through 60k keys 1000 at a time). It's not completely clear to me from 
the code, but is there a lot of SSTable reading involved when getting just the 
row names? And is this the best way to read all of the row names in a CF? 
Thanks.

-Jeffrey



understanding tombstones

2011-03-09 Thread Jeffrey Wang
Hey all,

I was wondering if this is the expected behavior of deletes (0.7.0). Let's say 
I have a 1-node cluster with a single CF which has gc_grace_seconds = 0. The 
following sequence of operations happens (in the given order):

insert row X with timestamp T
delete row X with timestamp T+1
force flush + compaction
insert row X with timestamp T

My understanding is that the tombstone created by the delete (and row X) will 
disappear with the flush + compaction which means the last insertion should 
show up. My experimentation, however, suggests otherwise (the last insertion 
does not show up).

I believe I have traced this to the fact that the markedForDeleteAt field on 
the ColumnFamily does not get reset after a compaction (after gc_grace_seconds 
has passed); is this desirable? I think it introduces an inconsistency in how 
tombstoned columns work versus tombstoned CFs. Thanks.

-Jeffrey



RE: understanding tombstones

2011-03-09 Thread Jeffrey Wang
Yup. https://issues.apache.org/jira/browse/CASSANDRA-2305

-Jeffrey

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Wednesday, March 09, 2011 6:19 PM
To: user@cassandra.apache.org
Subject: Re: understanding tombstones

On Wed, Mar 9, 2011 at 4:54 PM, Jeffrey Wang jw...@palantir.com wrote:
 insert row X with timestamp T
 delete row X with timestamp T+1
 force flush + compaction
 insert row X with timestamp T

 My understanding is that the tombstone created by the delete (and row X)
 will disappear with the flush + compaction which means the last insertion
 should show up.

Right.

 I believe I have traced this to the fact that the markedForDeleteAt field on
 the ColumnFamily does not get reset after a compaction (after
 gc_grace_seconds has passed); is this desirable? I think it introduces an
 inconsistency in how tombstoned columns work versus tombstoned CFs. Thanks.

That does sound like a bug.  Can you create a ticket?

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


when do snapshots go away?

2011-03-07 Thread Jeffrey Wang
Hi all,

When I drop a column family, it creates a snapshot. When does the snapshot go 
away and free up the disk space? I was able to run nodetool clearsnapshot to 
get rid of them, but will they go away themselves? (Also, is there a purpose to 
keeping a snapshot around?)

-Jeffrey



RE: memtable_flush_after_mins setting not working

2011-02-25 Thread Jeffrey Wang
I just noticed this thread. Does this mean that (assuming the same setup of an 
empty keyspace and CFs added later) if I have a CF that I write to for some 
time, but not enough to hit the flush limits, it will never get flushed until 
the server is restarted? I believe this is causing commit logs to not be 
deleted, which is taking up a ton of disk space (in addition to a bunch of 
small memtables being stuck in memory).

-Jeffrey

From: Ching-Cheng Chen [mailto:cc...@evidentsoftware.com]
Sent: Thursday, February 17, 2011 8:52 AM
To: user@cassandra.apache.org
Cc: Jonathan Ellis
Subject: Re: memtable_flush_after_mins setting not working

https://issues.apache.org/jira/browse/CASSANDRA-2183

Regards,

Chen

www.evidentsoftware.comhttp://www.evidentsoftware.com
On Thu, Feb 17, 2011 at 11:47 AM, Ching-Cheng Chen 
cc...@evidentsoftware.commailto:cc...@evidentsoftware.com wrote:
Certainly, I'll open a ticket to track this issue.

Regards,

Chen

www.evidentsoftware.comhttp://www.evidentsoftware.com

On Thu, Feb 17, 2011 at 11:42 AM, Jonathan Ellis 
jbel...@gmail.commailto:jbel...@gmail.com wrote:
Your analysis sounds correct to me.  Can you open a ticket on
https://issues.apache.org/jira/browse/CASSANDRA ?

On Thu, Feb 17, 2011 at 10:17 AM, Ching-Cheng Chen
cc...@evidentsoftware.commailto:cc...@evidentsoftware.com wrote:
 We have observed the behavior that memtable_flush_after_mins setting not
 working occasionally.   After some testing and code digging, we finally
 figured out what going on.
 The memtable_flush_after_mins won't work on certain condition with current
 implementation in Cassandra.

 In org.apache.cassandra.db.Table,  the scheduled flush task is setup by the
 following code during construction.

 int minCheckMs = Integer.MAX_VALUE;

 for (ColumnFamilyStore cfs : columnFamilyStores.values())
 {
 minCheckMs = Math.min(minCheckMs, cfs.getMemtableFlushAfterMins() * 60 *
 1000);
 }
 Runnable runnable = new Runnable()
 {
public void run()
{
for (ColumnFamilyStore cfs : columnFamilyStores.values())
{
cfs.forceFlushIfExpired();
}
}
 };
 flushTask = StorageService.scheduledTasks.scheduleWithFixedDelay(runnable,
 minCheckMs, minCheckMs, TimeUnit.MILLISECONDS);

 Now for our application, we will create a keyspacewithout any columnfamily
 first.  And only add needed columnfamily later depends on request.
 However, when keyspacegot created (without any columnfamily ), the above
 code will actually schedule a fixed delay flush check task with
 Integer.MAX_VALUE ms
 since there is no columnfamily yet.
 Later when you add columnfamily to this empty keyspace, the initCf() method
 in Table.java doesn't check whether the scheduled flush check task interval
 need
 to be updated or not.   To fix this, we'd need to restart the Cassandra
 after columnfamily added into the keyspace.
 I would suggest that add additional logic in initCf() method to recreate a
 scheduled flush check task if needed.
 Regards,
 Chen
 www.evidentsoftware.comhttp://www.evidentsoftware.com


--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


--
www.evidentsoftware.comhttp://www.evidentsoftware.com



dropped mutations, UnavailableException, and long GC

2011-02-24 Thread Jeffrey Wang
Hey all,

Our setup is 5 machines running Cassandra 0.7.0 with 24GB of heap and 1.5TB 
disk each collocated in a DC. We're doing bulk imports from each of the nodes 
with RF = 2 and write consistency ANY (write perf is very important). The 
behavior we're seeing is this:


-  Nodes often see each other as dead even though none of the nodes 
actually go down. I suspect this may be due to long GCs. It seems like 
increasing the RPC timeout could help this, but I'm not convinced this is the 
root of the problem. Note that in this case writes return with the 
UnavailableException.

-  As mentioned, long GCs. We see the ParNew GC doing a lot of smaller 
collections (few hundred MB) which are very fast (few hundred ms), but every 
once in a while the ConcurrentMarkSweep will take a LONG time (up to 15 min!) 
to collect upwards of 15GB at once.

-  On some nodes, we see a lot of pending MutationStages build up (e.g. 
500K), which leads to the messages Dropped X MUTATION messages in the last 
5000ms, presumably meaning that Cassandra has decided to not write one of the 
replicas of the data. This is not a HUGE deal, but is less than ideal.

-  The end result is that a bunch of writes end up failing due to the 
UnavailableExceptions, so not all of our data is getting into Cassandra.

So my question is: what is the best way to avoid this behavior? Our memtable 
thresholds are fairly low (256MB) so there should be plenty of heap space to 
work with. We may experiment with write consistency ONE or ALL to see if the 
perf hit is not too bad, but I wanted to get some opinions on why this might be 
happening. Thanks!

-Jeffrey



RE: rolling window of data

2011-02-03 Thread Jeffrey Wang
Thanks for the response, but unfortunately a TTL is not enough for us. We would 
like to be able to dynamically control the window in case there is an unusually 
large amount of data or something so we don't run out of disk space.

One question I have in particular is: if I use the timestamp of my log entries 
(not necessarily correlated at all with the timestamp of insert) as the 
timestamp on my mutations will Cassandra do the right thing when I delete? We 
don't have any need for conflict resolution, so we are currently just using the 
current time.

It seems like there is a possibility, depending on the implementation details 
of Cassandra, that I could call a remove with a timestamp for which everything 
before that should get deleted. Like I said before, this seems a bit hacky to 
me, but would it get the job done?

-Jeffrey

-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: Thursday, February 03, 2011 8:48 AM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

 The correct way to accomplish what you describe is the new (in 0.7)
 per-column TTL.  Simply set this to 60 * 60 * 24 * 90 (90 day's worth of
 seconds) and your columns will magically disappear after that length of
 time.

Although that assumes it's okay to loose data or that there is some
other method in place to prevent loss of it should the data not be
processed to whatever extent is required.

TTL:s would be a great way to efficiently achieve the windowing, but
it does remove the ability to explicitly control exactly when data is
removed (such as after certain batch processing of it has completed).

-- 
/ Peter Schuller


RE: rolling window of data

2011-02-03 Thread Jeffrey Wang
To be a little more clear, a simplified version of what I'm asking is:

Let's say you add 1K columns with timestamps 1 to 1000. Then, at an arbitrarily 
distant point in the future, if you call remove on that CF with timestamp 500 
(so the timestamps are logically out of order), will it delete exactly half of 
it or is there stuff that might go on under the covers that makes this not work 
as you might expect?

-Jeffrey

-Original Message-
From: Jeffrey Wang [mailto:jw...@palantir.com] 
Sent: Thursday, February 03, 2011 3:03 PM
To: user@cassandra.apache.org
Subject: RE: rolling window of data

Thanks for the response, but unfortunately a TTL is not enough for us. We would 
like to be able to dynamically control the window in case there is an unusually 
large amount of data or something so we don't run out of disk space.

One question I have in particular is: if I use the timestamp of my log entries 
(not necessarily correlated at all with the timestamp of insert) as the 
timestamp on my mutations will Cassandra do the right thing when I delete? We 
don't have any need for conflict resolution, so we are currently just using the 
current time.

It seems like there is a possibility, depending on the implementation details 
of Cassandra, that I could call a remove with a timestamp for which everything 
before that should get deleted. Like I said before, this seems a bit hacky to 
me, but would it get the job done?

-Jeffrey

-Original Message-
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: Thursday, February 03, 2011 8:48 AM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

 The correct way to accomplish what you describe is the new (in 0.7)
 per-column TTL.  Simply set this to 60 * 60 * 24 * 90 (90 day's worth of
 seconds) and your columns will magically disappear after that length of
 time.

Although that assumes it's okay to loose data or that there is some
other method in place to prevent loss of it should the data not be
processed to whatever extent is required.

TTL:s would be a great way to efficiently achieve the windowing, but
it does remove the ability to explicitly control exactly when data is
removed (such as after certain batch processing of it has completed).

-- 
/ Peter Schuller


rolling window of data

2011-02-02 Thread Jeffrey Wang
Hi,

We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. 
last 90 days). We use the timestamp of the log entries as the column names so 
we can do time range queries. Everything seems to be working fine, but it's not 
clear if there is an efficient way to delete data that is more than 90 days old.

Originally I thought that using a slice range on a deletion would do the trick, 
but that apparently is not supported yet. Another idea I had was to store the 
timestamp of the log entry as Cassandra's timestamp and pass in artificial 
timestamps to remove (thrift API), but that seems hacky. Does anyone know if 
there is a good way to support this kind of rolling window of data efficiently? 
Thanks.

-Jeffrey



RE: rolling window of data

2011-02-02 Thread Jeffrey Wang
Thanks for the link, but unfortunately it doesn't look like it uses a rolling 
window. As far as I can tell, log entries just keep getting inserted into 
Cassandra.

-Jeffrey

From: Aaron Morton [mailto:aa...@thelastpickle.com]
Sent: Wednesday, February 02, 2011 9:21 PM
To: user@cassandra.apache.org
Subject: Re: rolling window of data

This project may provide some inspiration for you 
https://github.com/thobbs/logsandra

Not sure if it has a rolling window, if you find out let me know :)

Aaron


On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang jw...@palantir.com wrote:
Hi,

We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. 
last 90 days). We use the timestamp of the log entries as the column names so 
we can do time range queries. Everything seems to be working fine, but it's not 
clear if there is an efficient way to delete data that is more than 90 days old.

Originally I thought that using a slice range on a deletion would do the trick, 
but that apparently is not supported yet. Another idea I had was to store the 
timestamp of the log entry as Cassandra's timestamp and pass in artificial 
timestamps to remove (thrift API), but that seems hacky. Does anyone know if 
there is a good way to support this kind of rolling window of data efficiently? 
Thanks.

-Jeffrey