Re: High BloomFilterFalseRation
Hi all had some time yesterday to dig a lil deeper. And maybe this saves someone who made the same mistake the time so ... After trying to reproduce the problem in unit tests with the same data which led nowhere because every single result was almost exactly what the math promised and incidentally stumbling upon this one: http://sites.google.com/site/murmurhash/murmurhash2flaw thinking omg all is lost ... I finally found that everything is just fine. Turns out that the jmx BloomFilterFalseRation simply does not show what I expected it to be. I thought it would provide a quality measure how good the bloom filter works in terms of hit rate. Which would be (Unnecessary File Lookups / Total Lookups) but it is ( False Positives / ( False + True Positives) ) which means it does not count all hits that where rejected by the filter. So if you would only ask for rows that do not exist this ration will always show 1.0 Meaning it is rather a measure of how many of your queries ask for non existing values. Cheers, Daniel On Oct 28, 2010, at 1:10 PM, Daniel Doubleday wrote: Hi Ryan I took a sample of one sstable (just flushed, not compacted). I compared 2 samples of sstables. One that is showing fine false positive ratios and the problem one. And yes both look the same to me. Both have the expected 15 buckets per row and the cardinality of the bitsets are the same. But I am pretty sure that it is indeed as suggested a problem with skewed query pattern. I stopped the import and started a random read test and things look better. I'll try to reproduce this with a patched cassandra to get more debug info to figure out why this is happening. Because I still don't understand it. Thanks for your time everyone == Sample of problem CD == DATA FILE file size: 68804626 bytes rows: 7432 FILTER FILE file size: 14013 bytes bloom filter bitset size: 111488 bloom filter bitset cardinalaity: 54062 == Sample of working CF == DATA FILE file size: 110730565 bytes rows: 47432 FILTER FILE file size: 96565 bytes bloom filter bitset size: 771904 bloom filter bitset cardinalaity: 354610 On Oct 27, 2010, at 6:41 PM, Ryan King wrote: On Wed, Oct 27, 2010 at 3:24 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi people We are currently moving our second use case from mysql to cassandra. While importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty high compared to another CF which is in used in production right now. Its a hierarchical data model and I cannot avoid to do a read before inserting multiple columns. I see a false positive ration of 0.28 while in my other CF it is 0.00025. The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length() ~ 60) Just wanted to check if this value is to be expected. This is not expected. How big are the bloom filters on disk? -ryan
Re: High BloomFilterFalseRation
On Tue, Nov 2, 2010 at 1:28 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi all had some time yesterday to dig a lil deeper. And maybe this saves someone who made the same mistake the time so ... After trying to reproduce the problem in unit tests with the same data which led nowhere because every single result was almost exactly what the math promised and incidentally stumbling upon this one: http://sites.google.com/site/murmurhash/murmurhash2flaw thinking omg all is lost ... I finally found that everything is just fine. Turns out that the jmx BloomFilterFalseRation simply does not show what I expected it to be. I thought it would provide a quality measure how good the bloom filter works in terms of hit rate. Which would be (Unnecessary File Lookups / Total Lookups) but it is ( False Positives / ( False + True Positives) ) which means it does not count all hits that where rejected by the filter. So if you would only ask for rows that do not exist this ration will always show 1.0 Meaning it is rather a measure of how many of your queries ask for non existing values. That sounds like something we should change. -ryan
High BloomFilterFalseRation
Hi people We are currently moving our second use case from mysql to cassandra. While importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty high compared to another CF which is in used in production right now. Its a hierarchical data model and I cannot avoid to do a read before inserting multiple columns. I see a false positive ration of 0.28 while in my other CF it is 0.00025. The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length() ~ 60) Just wanted to check if this value is to be expected. Thanks, Daniel
Re: High BloomFilterFalseRation
Hm - not sure if I understand the random question. We are using RP. But I wouldn't know why that should matter. I thought that the bloom filter hash function should evenly distribute no matter what keys come in. Keys are '/' separated strings (aka paths :-)) I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each) [ {'a/b/foo': cols}, {'a/b/bar': cols}, {'a/b/baz': cols} ] and before that I would query for 'a/b'. Recursively as in mkdir -p If parent paths are missing they would be inserted with the bulk insert. The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in the last couple of hours. Mostly around 0.3 We're on 0.6.6 btw On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote: This is not expected, no. How random are your queries? If you have a couple outlier rows causing the false positives that are being queried over and over then that could just be the luck of the draw. On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi people We are currently moving our second use case from mysql to cassandra. While importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty high compared to another CF which is in used in production right now. Its a hierarchical data model and I cannot avoid to do a read before inserting multiple columns. I see a false positive ration of 0.28 while in my other CF it is 0.00025. The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length() ~ 60) Just wanted to check if this value is to be expected. Thanks, Daniel -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: High BloomFilterFalseRation
Do you have a key a/b then? What columns does it have? On Wed, Oct 27, 2010 at 9:14 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hm - not sure if I understand the random question. We are using RP. But I wouldn't know why that should matter. I thought that the bloom filter hash function should evenly distribute no matter what keys come in. Keys are '/' separated strings (aka paths :-)) I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each) [ {'a/b/foo': cols}, {'a/b/bar': cols}, {'a/b/baz': cols} ] and before that I would query for 'a/b'. Recursively as in mkdir -p If parent paths are missing they would be inserted with the bulk insert. The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in the last couple of hours. Mostly around 0.3 We're on 0.6.6 btw On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote: This is not expected, no. How random are your queries? If you have a couple outlier rows causing the false positives that are being queried over and over then that could just be the luck of the draw. On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi people We are currently moving our second use case from mysql to cassandra. While importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty high compared to another CF which is in used in production right now. Its a hierarchical data model and I cannot avoid to do a read before inserting multiple columns. I see a false positive ration of 0.28 while in my other CF it is 0.00025. The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length() ~ 60) Just wanted to check if this value is to be expected. Thanks, Daniel -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: High BloomFilterFalseRation
Ah of course - question makes total sense. But no: this is not the case: I am not constantly asking the same question since the tree is deep enough. Most data nodes are level 5 from the root. So the parents getting queried will be different most of the time. Since the parent nodes are created the queries stop there and don't propagate toward the root. And I am seeing the high values all the time. Best that it gets is 0.15. Daniel On 27.10.10 18:37, Mike Malone wrote: I think he was asking about queries, not data. The data may be randomly distributed by way of a hash on the key, but if your queries are heavily skewed (e.g., if you query for foo a lot more than foo/bar, and foo randomly happens to trigger a false positive) the skew in your query pattern could cause a seemingly strange spike in false positives. With a hierarchical data model it's not unlikely that this sort of skew exists since you'd tend to query for items towards the root of the hierarchy more frequently. Mike On Wed, Oct 27, 2010 at 2:14 PM, Daniel Doubleday daniel.double...@gmx.net mailto:daniel.double...@gmx.net wrote: Hm - not sure if I understand the random question. We are using RP. But I wouldn't know why that should matter. I thought that the bloom filter hash function should evenly distribute no matter what keys come in. Keys are '/' separated strings (aka paths :-)) I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each) [ {'a/b/foo': cols}, {'a/b/bar': cols}, {'a/b/baz': cols} ] and before that I would query for 'a/b'. Recursively as in mkdir -p If parent paths are missing they would be inserted with the bulk insert. The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in the last couple of hours. Mostly around 0.3 We're on 0.6.6 btw On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote: This is not expected, no. How random are your queries? If you have a couple outlier rows causing the false positives that are being queried over and over then that could just be the luck of the draw. On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday daniel.double...@gmx.net mailto:daniel.double...@gmx.net wrote: Hi people We are currently moving our second use case from mysql to cassandra. While importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to be pretty high compared to another CF which is in used in production right now. Its a hierarchical data model and I cannot avoid to do a read before inserting multiple columns. I see a false positive ration of 0.28 while in my other CF it is 0.00025. The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately (key.length() ~ 60) Just wanted to check if this value is to be expected. Thanks, Daniel -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com