Is there a way to turn this on and off through configuration? I am not necessarily sure I would want this feature. Also it is confusing if these writes show up in JMX and look like user generated write operations.
On Mon, Dec 17, 2012 at 10:01 AM, Mike <mthero...@yahoo.com> wrote: > Thank you Aaron, this was very helpful. > > Could it be an issue that this optimization does not really take effect > until the memtable with the hoisted data is flushed? In my simple example > below, the same row is updated and multiple selects of the same row will > result in multiple writes to the memtable. It seems it maybe possible > (although unlikely) where, if you go from a write-mostly to a read-mostly > scenario, you could get into a state where you are stuck rewriting to the > same memtable, and the memtable is not flushed because it absorbs the > over-writes. I can foresee this especially if you are reading the same > rows repeatedly. > > I also noticed from the codepaths that if Row caching is enabled, this > optimization will not occur. We made some changes this weekend to make > this column family more suitable to row-caching and enabled row-caching > with a small cache. Our initial results is that it seems to have corrected > the write counts, and has increased performance quite a bit. However, are > there any hidden gotcha's there because this optimization is not > occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions > a "compaction is behind" problem. Any history on that? I couldn't find > too much information on it. > > Thanks, > -Mike > > On 12/16/2012 8:41 PM, aaron morton wrote: > > > 1) Am I reading things correctly? > > Yes. > If you do a read/slice by name and more than min compaction level nodes > where read the data is re-written so that the next read uses fewer SSTables. > > 2) What is really happening here? Essentially minor compactions can > occur between 4 and 32 memtable flushes. Looking through the code, this > seems to only effect a couple types of select statements (when selecting a > specific column on a specific key being one of them). During the time > between these two values, every "select" statement will perform a write. > > Yup, only for readying a row where the column names are specified. > Remember minor compaction when using SizedTiered Compaction (the default) > works on buckets of the same size. > > Imagine a row that had been around for a while and had fragments in more > than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd > tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. > If this row is read it will get hoisted back up. > > But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st > tier it will not hoisted. > > There are a few short circuits in the SliceByName read path. One of them > is to end the search when we know that no other SSTables contain columns > that should be considered. So if the 4 columns you read frequently are > hoisted into the 1st bucket your reads will get handled by that one bucket. > > It's not every select. Just those that touched more the min compaction > sstables. > > > 3) Is this desired behavior? Is there something else I should be > looking at that could be causing this behavior? > > Yes. > https://issues.apache.org/jira/browse/CASSANDRA-2503 > > Cheers > > > ----------------- > Aaron Morton > Freelance Cassandra Developer > New Zealand > > @aaronmorton > http://www.thelastpickle.com > > On 15/12/2012, at 12:58 PM, Michael Theroux <mthero...@yahoo.com> wrote: > > Hello, > > We have an unusual situation that I believe I've reproduced, at least > temporarily, in a test environment. I also think I see where this issue is > occurring in the code. > > We have a specific column family that is under heavy read and write load > on a nightly basis. For the purposes of this description, I'll refer to > this column family as "Bob". During this nightly processing, sometimes Bob > is under very write load, other times it is very heavy read load. > > The application is such that when something is written to Bob, a write > is made to one of two other tables. We've witnessed a situation where the > write count on Bob far outstrips the write count on either of the other > tables, by a factor of 3->10. This is based on the WriteCount available on > the column family JMX MBean. We have not been able to find where in our > code this is happening, and we have gone as far as tracing our CQL calls to > determine that the relationship between Bob and the other tables are what > we expect. > > I brought up a test node to experiment, and see a situation where, when > a "select" statement is executed, a write will occur. > > In my test, I perform the following (switching between nodetool and > cqlsh): > > update bob set 'about'='coworker' where key='<hex key>'; > nodetool flush > update bob set 'about'='coworker' where key='<hex key>'; > nodetool flush > update bob set 'about'='coworker' where key='<hex key>'; > nodetool flush > update bob set 'about'='coworker' where key='<hex key>'; > nodetool flush > update bob set 'about'='coworker' where key='<hex key>'; > nodetool flush > > Then, for a period of time (before a minor compaction occurs), a select > statement that selects specific columns will cause writes to occur in the > write count of the column family: > > select about,changed,data from bob where key='<hex key>'; > > This situation will continue until a minor compaction is completed. > > I went into the code and added some traces to CollationController.java: > > private ColumnFamily collectTimeOrderedData() > { > logger.debug("collectTimeOrderedData"); > > ... <snip> ... > > > ---> HERE logger.debug( "tables iterated: " + sstablesIterated + " Min > compact: " + cfs.getMinimumCompactionThreshold() ); > > // "hoist up" the requested data into a more recent sstable > if (sstablesIterated > cfs.getMinimumCompactionThreshold() > && !cfs.isCompactionDisabled() > && cfs.getCompactionStrategy() instanceof > SizeTieredCompactionStrategy) > { > RowMutation rm = new RowMutation(cfs.table.name, new > Row(filter.key, returnCF.cloneMe())); > try > { > > ---> HERE logger.debug( "Apply hoisted up row mutation" ); > > // skipping commitlog and index updates is fine since > we're just de-fragmenting existing data > Table.open(rm.getTable()).apply(rm, false, false); > } > catch (IOException e) > { > // log and allow the result to be returned > logger.error("Error re-writing read results", e); > } > } > > ... <snip> ... > > Performing the steps above, I see the following traces (in the test > environment I decreased the minimum compaction threshold to make this > easier to reproduce). After I do a couple of update/flush, I see this in > the log: > > DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java > (line 117) Scheduling a background task check for bob with > SizeTieredCompactionStrategy > > Then, until compaction occurs, I see (when performing a select): > > DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java > (line 86) Disseminating load info ... > DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) > execute_cql_query > DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) > CQL statement type: SELECT > DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) > Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', > key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, > columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', > columnName='null')', columns=[about,changed,data,])/ONE > DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) > Blockfor is 1; setting up requests to /10.0.4.20 > DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) > reading data locally > DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) > LocalReadRunnable reading SliceByNamesReadCommand(table='open', > key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, > columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', > columnName='null')', columns=[about,changed,data,]) > DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java > (line 68) In get top level columns: class > org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class > org.apache.cassandra.db.marshal.BytesType > DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java > (line 84) collectTimeOrderedData > ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java > (line 188) tables iterated: 4 Min compact: 2 > ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 > CollationController.java (line 198) Apply hoisted up row mutation > DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) > applying mutation of row > 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342 > > The above traces will occur every time I repeat the above select > statement. > > Minor compaction doesn't start until a few minutes after the request was > submitted above (note, this is an unloaded test node): > > DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java > (line 45) Creating IntervalNode from > [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce... > > Once minor compaction occurs, the behavior around write count being > incremented stops, until more than the minimum compaction threshold > memtables are flush to disk. > > So, my questions are: > > 1) Am I reading things correctly? > > 2) What is really happening here? Essentially minor compactions can > occur between 4 and 32 memtable flushes. Looking through the code, this > seems to only effect a couple types of select statements (when selecting a > specific column on a specific key being one of them). During the time > between these two values, every "select" statement will perform a write. > > 3) Is this desired behavior? Is there something else I should be > looking at that could be causing this behavior? > > We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy. > Any help is appreciated, > Thanks, > -Mike > > > > > > >