date:20110627

[jira] [Created] (CASSANDRA-2830) Allow summing of counter columns in CQL

2011-06-27 Thread Tomas Salfischberger (JIRA)

Allow summing of counter columns in CQL
---

 Key: CASSANDRA-2830
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2830
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Tomas Salfischberger


CQL could be extended with a method to calculate the sum of a set of counter 
columns. This avoids transferring a long list of counter columns to be summed 
by the client, while the server could calculate the total and instead only 
transfer that result. My proposal for the syntax (based on the COUNT() 
suggestion in the comments of CASSANDRA-1704):
{code}SELECT SUM(columnFrom..columnTo) FROM CF WHERE ...{code}

The simplest approach would be to only allow summing of counters under the same 
key, thus a query with a WHERE part that specifies multiple keys would return 1 
result per key. This avoids summing values from different nodes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

2011-06-27 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055421#comment-13055421
]

Sylvain Lebresne commented on CASSANDRA-2816:
-

bq. We have also spotted very noticable issues with full GCs when the merkle
trees are passed around. Hopefully this could fix that too.

This do make sure that we don't do multiple validation at the same time and
that we keep a small number of merkle tree in memory at the same time. So I
suppose this could help on the GC side. But overall I don't know if I am too
optimistic about that, in part because I'm not sure what causes your issues.
But this can't hurt on that side at least.

bq. I will see if I can get this patch tested somewhere if it is ready for that.

I believe it should be ready for that.

bq. would it be an potential interesting idea to separate tombstones in
different sstables.

The thing is that some tombstones may be irrelevant become some update
supersedes it (this is specially true of row tombstones). Hence basing a repair
on tombstone only may transfer irrelevant data. I suppose it may depend on the
use case this will be more or less a big deal. Also, this means that a read
will be impacted in that we will often have to hit twice as many sstables.
Given that it's not a crazy idea either to want to repair data regularly (if
only for durability guarantee), I don't know if it is worth the trouble (we
would have to separate tombstones from data at flush time, we'll have to
maintain the two separate set of data/tombstone sstables, etc...).

bq. make compaction deterministic or synchronized by a master across nodes

Pretty sure we want to avoid going to a master architecture for everything if
we can. Having master means that failure handling is more difficult (think
network partition for instance) and require leader election and such, and the
whole point of the fully distribution of Cassandra is to avoid those. Even
without consider those, synchronizing compaction means synchronizing flush
somehow and you want to be precise if you're going to use whole sstable md5s,
which will be hard and quite probably inefficient.

Repair doesn't synchronize merkle tree creation properly

Key: CASSANDRA-2816
URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
Project: Cassandra
Issue Type: Bug
Components: Core
Reporter: Sylvain Lebresne
Assignee: Sylvain Lebresne
Labels: repair
Fix For: 0.8.2

Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch

Being a little slow, I just realized after having opened CASSANDRA-2811 and
CASSANDRA-2815 that there is a more general problem with repair.
When a repair is started, it will send a number of merkle tree to its
neighbor as well as himself and assume for correction that the building of
those trees will be started on every node roughly at the same time (if not,
we end up comparing data snapshot at different time and will thus mistakenly
repair a lot of useless data). This is bogus for many reasons:
* Because validation compaction runs on the same executor that other
compaction, the start of the validation on the different node is subject to
other compactions. 0.8 mitigates this in a way by being multi-threaded (and
thus there is less change to be blocked a long time by a long running
compaction), but the compaction executor being bounded, its still a problem)
* if you run a nodetool repair without arguments, it will repair every CFs.
As a consequence it will generate lots of merkle tree requests and all of
those requests will be issued at the same time. Because even in 0.8 the
compaction executor is bounded, some of those validations will end up being
queued behind the first ones. Even assuming that the different validation are
submitted in the same order on each node (which isn't guaranteed either),
there is no guarantee that on all nodes, the first validation will take the
same time, hence desynchronizing the queued ones.
Overall, it is important for the precision of repair that for a given CF and
range (which is the unit at which trees are computed), we make sure that all
node will start the validation at the same time (or, since we can't do magic,
as close as possible).
One (reasonably simple) proposition to fix this would be to have repair
schedule validation compactions across nodes one by one (i.e, one CF/range at
a time), waiting for all nodes to return their tree before submitting the
next request. Then on each node, we should make sure that the node will start
the validation compaction as soon as requested. For that, we probably want to
have a specific executor for validation compaction and:
* either we fail the

[jira] [Updated] (CASSANDRA-2830) Allow summing of counter columns in CQL

2011-06-27 Thread Sylvain Lebresne (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2830:


Priority: Minor  (was: Major)

 Allow summing of counter columns in CQL
 ---

 Key: CASSANDRA-2830
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2830
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Tomas Salfischberger
Priority: Minor
  Labels: CQL

 CQL could be extended with a method to calculate the sum of a set of counter 
 columns. This avoids transferring a long list of counter columns to be summed 
 by the client, while the server could calculate the total and instead only 
 transfer that result. My proposal for the syntax (based on the COUNT() 
 suggestion in the comments of CASSANDRA-1704):
 {code}SELECT SUM(columnFrom..columnTo) FROM CF WHERE ...{code}
 The simplest approach would be to only allow summing of counters under the 
 same key, thus a query with a WHERE part that specifies multiple keys would 
 return 1 result per key. This avoids summing values from different nodes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2830) Allow summing of counter columns in CQL

2011-06-27 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055445#comment-13055445
 ] 

Sylvain Lebresne commented on CASSANDRA-2830:
-

It's not a crazy idea. Though we should at the very least make it generic 
enough to have AVG(), MIN(), MAX() and such.

 Allow summing of counter columns in CQL
 ---

 Key: CASSANDRA-2830
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2830
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Tomas Salfischberger
Priority: Minor
  Labels: CQL

 CQL could be extended with a method to calculate the sum of a set of counter 
 columns. This avoids transferring a long list of counter columns to be summed 
 by the client, while the server could calculate the total and instead only 
 transfer that result. My proposal for the syntax (based on the COUNT() 
 suggestion in the comments of CASSANDRA-1704):
 {code}SELECT SUM(columnFrom..columnTo) FROM CF WHERE ...{code}
 The simplest approach would be to only allow summing of counters under the 
 same key, thus a query with a WHERE part that specifies multiple keys would 
 return 1 result per key. This avoids summing values from different nodes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2521) Move away from Phantom References for Compaction/Memtable

2011-06-27 Thread Sylvain Lebresne (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sylvain Lebresne updated CASSANDRA-2521:

Attachment: 2521-v4.txt

bq. why does DT.removeOldSSTableSize acquire/release around markCompacted?

For the record, this is because this make the logic in SSTR.markCompacted and
SSTR.releaseReference easier. If caller of markCompacted don't acquire a
reference, markCompacted will have to deal with two cases: either no thread
have a reference acquired, in which case the current thread should schedule the
deletion, or other thread have a reference in which case it should let them the
task of scheduling the deletion where they are done. But making this thread
safe (so that we don't schedule twice or forget to schedule the deletion if the
last thread holding a reference release it at the same time as the
markCompacted is called) is a bit of annoying. Acquire a reference when
markCompacted is called make this easier and move all the logic in
releaseReference.

bq. I believe currently, files are not deleted until the entire repair is
finished

The file should get deleted as soon as they are not useful anymore, that is as
soon as they have been streamed. That being said, there was a bug, see below.

bq. Did another repair overnight, one minor compaction included some 20 small
sstables, all of them remains as well as a few from other compactions and the
files from the repairs described before

Yes, I did find a place where we were not correctly decrementing the reference
count for streaming (repair was not unmarking sstable that were not streamed
because they had nothing for the range to transfer). Attached v4 patch should
fix that.

bq. As for the last version of this patch, a quick look tonight shows access
problems with markCurrentViewReferenced()

v4 is based on v3 and fixes this (it reintroduce a specific method instead of
making View public because I'm not too keen on doing that, but that can change
if someone feels strongly about that).

v4 also fix a bug in StreamingTransferTest and another one related to null
segment in the unmmapping cleanup code.

Move away from Phantom References for Compaction/Memtable
-

Key: CASSANDRA-2521
URL: https://issues.apache.org/jira/browse/CASSANDRA-2521
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Chris Goffinet
Assignee: Sylvain Lebresne
Fix For: 1.0

Attachments:
0001-Use-reference-counting-to-decide-when-a-sstable-can-.patch,
0001-Use-reference-counting-to-decide-when-a-sstable-can-v2.patch,
0002-Force-unmapping-files-before-deletion-v2.patch, 2521-v3.txt, 2521-v4.txt

http://wiki.apache.org/cassandra/MemtableSSTable
Let's move to using reference counting instead of relying on GC to be called
in StorageService.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2830) Allow summing of counter columns in CQL

2011-06-27 Thread Tomas Salfischberger (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055448#comment-13055448
 ] 

Tomas Salfischberger commented on CASSANDRA-2830:
-

Good point, you could have a generic function implementation that is allowed to 
do whatever it wants with an Iterator over the counter values and return a 
single value. That would support easy implementation of SUM, MIN, MAX, AVG, but 
also things like standard deviation and variance when the need arises.

 Allow summing of counter columns in CQL
 ---

 Key: CASSANDRA-2830
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2830
 Project: Cassandra
  Issue Type: New Feature
  Components: API
Reporter: Tomas Salfischberger
Priority: Minor
  Labels: CQL

 CQL could be extended with a method to calculate the sum of a set of counter 
 columns. This avoids transferring a long list of counter columns to be summed 
 by the client, while the server could calculate the total and instead only 
 transfer that result. My proposal for the syntax (based on the COUNT() 
 suggestion in the comments of CASSANDRA-1704):
 {code}SELECT SUM(columnFrom..columnTo) FROM CF WHERE ...{code}
 The simplest approach would be to only allow summing of counters under the 
 same key, thus a query with a WHERE part that specifies multiple keys would 
 return 1 result per key. This avoids summing values from different nodes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2816) Repair doesn't synchronize merkle tree creation properly

2011-06-27 Thread Terje Marthinussen (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055455#comment-13055455
 ] 

Terje Marthinussen commented on CASSANDRA-2816:
---

I don't know what causes GC when doing repairs either, but fire off repair on a 
few nodes with 100 million docs/node and there is a reasonable chance that a 
node here and there will log messages about reducing cache sizes due to memory 
pressure (I am not really sure it is a good idea to do this at all, reducing 
caches during stress rarely improves anything) or full GC.

The thought about the master controlled compaction would not really affect 
network splits etc.

Reconciliation after a network split is really as complex with or without a 
master. We need to get back to a state where all the nodes have the same data 
anyway which is a complex task anyway.

This is more a consideration of the fact that we do not necessarily need to 
live in quorum based world during compaction and we are free to use alternative 
approaches in the compaction without changing read/write path or affecting 
availability. Master selection is not really a problem here. Start compaction, 
talk to other nodes with the same token ranges, select a leader. 

Does not even have to be the same master every time and could consider if we 
could make compaction part of a background read repair to reduce the amount of 
times we need to read/write data. 

For instance, if we can verify that the oldest/biggest sstables is 100% in sync 
with data on other replicas when it is compacted (why not do it during 
compaction when we go through the data anyway rather than later?),can we use 
that info to optimize the scans done during repairs by only using data in 
sstables with data received after some checkpoint in time as the starting point 
for the consistency check?

 Repair doesn't synchronize merkle tree creation properly
 

 Key: CASSANDRA-2816
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2816
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Sylvain Lebresne
Assignee: Sylvain Lebresne
  Labels: repair
 Fix For: 0.8.2

 Attachments: 0001-Schedule-merkle-tree-request-one-by-one.patch


 Being a little slow, I just realized after having opened CASSANDRA-2811 and 
 CASSANDRA-2815 that there is a more general problem with repair.
 When a repair is started, it will send a number of merkle tree to its 
 neighbor as well as himself and assume for correction that the building of 
 those trees will be started on every node roughly at the same time (if not, 
 we end up comparing data snapshot at different time and will thus mistakenly 
 repair a lot of useless data). This is bogus for many reasons:
 * Because validation compaction runs on the same executor that other 
 compaction, the start of the validation on the different node is subject to 
 other compactions. 0.8 mitigates this in a way by being multi-threaded (and 
 thus there is less change to be blocked a long time by a long running 
 compaction), but the compaction executor being bounded, its still a problem)
 * if you run a nodetool repair without arguments, it will repair every CFs. 
 As a consequence it will generate lots of merkle tree requests and all of 
 those requests will be issued at the same time. Because even in 0.8 the 
 compaction executor is bounded, some of those validations will end up being 
 queued behind the first ones. Even assuming that the different validation are 
 submitted in the same order on each node (which isn't guaranteed either), 
 there is no guarantee that on all nodes, the first validation will take the 
 same time, hence desynchronizing the queued ones.
 Overall, it is important for the precision of repair that for a given CF and 
 range (which is the unit at which trees are computed), we make sure that all 
 node will start the validation at the same time (or, since we can't do magic, 
 as close as possible).
 One (reasonably simple) proposition to fix this would be to have repair 
 schedule validation compactions across nodes one by one (i.e, one CF/range at 
 a time), waiting for all nodes to return their tree before submitting the 
 next request. Then on each node, we should make sure that the node will start 
 the validation compaction as soon as requested. For that, we probably want to 
 have a specific executor for validation compaction and:
 * either we fail the whole repair whenever one node is not able to execute 
 the validation compaction right away (because no thread are available right 
 away).
 * we simply tell the user that if he start too many repairs in parallel, he 
 may start seeing some of those repairing more data than it should.

--
This message is automatically

[jira] [Commented] (CASSANDRA-2804) expose dropped messages, exceptions over JMX

2011-06-27 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055470#comment-13055470
 ] 

Sylvain Lebresne commented on CASSANDRA-2804:
-

Looks good. That being said, the recent variant in JMX is essentially the 
same as the one the StatusLogger is logging, and I could see a point in 
preferring to log this in JMX rather than in the log file. Not sure what is the 
best way to accommodate those two however.
In any case, the patch does improve on the current situation, so +1.

 expose dropped messages, exceptions over JMX
 

 Key: CASSANDRA-2804
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2804
 Project: Cassandra
  Issue Type: Improvement
  Components: Tools
Reporter: Jonathan Ellis
Assignee: Jonathan Ellis
Priority: Minor
 Fix For: 0.7.7, 0.8.2

 Attachments: 2804.txt, 
 twttr-cassandra-0.8-counts-resync-droppedmsg-metric.diff


 Patch against 0.7.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CASSANDRA-2280) Request specific column families using StreamIn

2011-06-27 Thread Sylvain Lebresne (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-2280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Lebresne updated CASSANDRA-2280:


Fix Version/s: (was: 0.8.1)
   0.8.2

 Request specific column families using StreamIn
 ---

 Key: CASSANDRA-2280
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2280
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Stu Hood
Assignee: Jonathan Ellis
 Fix For: 0.8.2

 Attachments: 
 0001-Allow-specific-column-families-to-be-requested-for-str.txt, 
 0001-Allow-specific-column-families-to-be-requested-for-str.txt, 2280-v3.txt, 
 2280-v4.txt, 2280-v5.txt


 StreamIn.requestRanges only specifies a keyspace, meaning that requesting a 
 range will request it for all column families: if you have a large number of 
 CFs, this can cause quite a headache.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2576) Rewrite into new file post streaming

2011-06-27 Thread Sylvain Lebresne (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13055486#comment-13055486
 ] 

Sylvain Lebresne commented on CASSANDRA-2576:
-

Looks good, but this apparently already needs rebasing.

 Rewrite into new file post streaming
 

 Key: CASSANDRA-2576
 URL: https://issues.apache.org/jira/browse/CASSANDRA-2576
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Stu Hood
Assignee: Stu Hood
 Fix For: 1.0

 Attachments: 
 0001-CASSANDRA-2576-Don-t-depend-on-a-byte-for-byte-match-f.txt, 
 0002-CASSANDRA-2576-Rebuild-into-a-new-file-to-minimize-mag.txt


 Commutative/counter column families use a separate path to rebuild sstables 
 post streaming, and that path currently rewrites the data within the streamed 
 file. While this is great for space efficiency, it means a duplicated code 
 path for writing sstables, which makes it more difficult to make changes like 
 #674.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

90 matches

Mail list logo