Re: High BloomFilterFalseRation

2010-11-02 Thread Daniel Doubleday
Hi all

had some time yesterday to dig a lil deeper. And maybe this saves someone who 
made the same mistake the time so ...

After trying to reproduce the problem in unit tests with the same data which 
led nowhere because every single result was almost exactly what the math 
promised and incidentally stumbling upon this one: 
http://sites.google.com/site/murmurhash/murmurhash2flaw thinking omg all is 
lost ... I finally found that everything is just fine.

Turns out that the jmx BloomFilterFalseRation simply does not show what I 
expected it to be. I thought it would provide a quality measure how good the 
bloom filter works in terms of hit rate. Which would be (Unnecessary File 
Lookups / Total Lookups) but it is ( False Positives / ( False + True 
Positives) ) which means it does not count all hits that where rejected by the 
filter.

So if you would only ask for rows that do not exist this ration will always 
show 1.0

Meaning it is rather a measure of how many of your queries ask for non existing 
values.

Cheers,
Daniel
 

On Oct 28, 2010, at 1:10 PM, Daniel Doubleday wrote:

 Hi Ryan
 
 I took a sample of one sstable (just flushed, not compacted). 
 
 I compared 2 samples of sstables. One that is showing fine false positive 
 ratios and the problem one. 
 And yes both look the same to me. Both have the expected 15 buckets per row 
 and the cardinality of the bitsets are the same.
 
 But I am pretty sure that it is indeed as suggested a problem with skewed 
 query pattern. I stopped the import and started a random read test and things 
 look better.
 
 I'll try to reproduce this with a patched cassandra to get more debug info to 
 figure out why this is happening. Because I still don't understand it.
 
 Thanks for your time everyone
 
 == Sample of problem CD ==
 
 DATA FILE
 
 file size: 68804626 bytes
 rows: 7432 
 
 FILTER FILE
 
 file size: 14013 bytes
 bloom filter bitset size: 111488
 bloom filter bitset cardinalaity: 54062
 
 
 == Sample of working CF ==
 
 DATA FILE
 
 file size: 110730565 bytes
 rows: 47432
 
 FILTER FILE
 
 file size: 96565 bytes
 bloom filter bitset size: 771904
 bloom filter bitset cardinalaity: 354610
 
 
 On Oct 27, 2010, at 6:41 PM, Ryan King wrote:
 
 On Wed, Oct 27, 2010 at 3:24 AM, Daniel Doubleday
 daniel.double...@gmx.net wrote:
 Hi people
 
 We are currently moving our second use case from mysql to cassandra. While 
 importing the data (ongoing) I noticed that the BloomFilterFalseRation 
 seems to be pretty high compared to another CF which is in used in 
 production right now.
 
 Its a hierarchical data model and I cannot avoid to do a read before 
 inserting multiple columns.
 
 I see a false positive ration of 0.28 while in my other CF it is 0.00025.
 
 The CF has 5 live sstables whiel I read that ratio. At that time I inserted 
 ~ 200k rows with a total of 1M cols. Row keys are pretty large 
 unfortunately (key.length() ~ 60)
 
 Just wanted to check if this value is to be expected.
 
 This is not expected. How big are the bloom filters on disk?
 
 -ryan
 



Re: High BloomFilterFalseRation

2010-11-02 Thread Ryan King
On Tue, Nov 2, 2010 at 1:28 AM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 Hi all

 had some time yesterday to dig a lil deeper. And maybe this saves someone who 
 made the same mistake the time so ...

 After trying to reproduce the problem in unit tests with the same data which 
 led nowhere because every single result was almost exactly what the math 
 promised and incidentally stumbling upon this one: 
 http://sites.google.com/site/murmurhash/murmurhash2flaw thinking omg all is 
 lost ... I finally found that everything is just fine.

 Turns out that the jmx BloomFilterFalseRation simply does not show what I 
 expected it to be. I thought it would provide a quality measure how good the 
 bloom filter works in terms of hit rate. Which would be (Unnecessary File 
 Lookups / Total Lookups) but it is ( False Positives / ( False + True 
 Positives) ) which means it does not count all hits that where rejected by 
 the filter.

 So if you would only ask for rows that do not exist this ration will always 
 show 1.0

 Meaning it is rather a measure of how many of your queries ask for non 
 existing values.

That sounds like something we should change.

-ryan


High BloomFilterFalseRation

2010-10-27 Thread Daniel Doubleday
Hi people

We are currently moving our second use case from mysql to cassandra. While 
importing the data (ongoing) I noticed that the BloomFilterFalseRation seems to 
be pretty high compared to another CF which is in used in production right now.

Its a hierarchical data model and I cannot avoid to do a read before inserting 
multiple columns.
 
I see a false positive ration of 0.28 while in my other CF it is 0.00025.

The CF has 5 live sstables whiel I read that ratio. At that time I inserted ~ 
200k rows with a total of 1M cols. Row keys are pretty large unfortunately 
(key.length() ~ 60)

Just wanted to check if this value is to be expected. 



Thanks,
Daniel

Re: High BloomFilterFalseRation

2010-10-27 Thread Daniel Doubleday
Hm -

not sure if I understand the random question. We are using RP. But I wouldn't 
know why that should matter.
I thought that the bloom filter hash function should evenly distribute no 
matter what keys come in.
 
Keys are '/' separated strings (aka paths :-))

I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)

[
{'a/b/foo': cols},
{'a/b/bar': cols},
{'a/b/baz': cols}
]

and before that I would query for 'a/b'. Recursively as in mkdir -p

If parent paths are missing they would be inserted with the bulk insert.

The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in the 
last couple of hours. Mostly around 0.3

We're on 0.6.6 btw


On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:

 This is not expected, no.  How random are your queries?  If you have a
 couple outlier rows causing the false positives that are being queried
 over and over then that could just be the luck of the draw.
 
 On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday
 daniel.double...@gmx.net wrote:
 Hi people
 
 We are currently moving our second use case from mysql to cassandra. While 
 importing the data (ongoing) I noticed that the BloomFilterFalseRation seems 
 to be pretty high compared to another CF which is in used in production 
 right now.
 
 Its a hierarchical data model and I cannot avoid to do a read before 
 inserting multiple columns.
 
 I see a false positive ration of 0.28 while in my other CF it is 0.00025.
 
 The CF has 5 live sstables whiel I read that ratio. At that time I inserted 
 ~ 200k rows with a total of 1M cols. Row keys are pretty large unfortunately 
 (key.length() ~ 60)
 
 Just wanted to check if this value is to be expected.
 
 
 
 Thanks,
 Daniel
 
 
 
 -- 
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: High BloomFilterFalseRation

2010-10-27 Thread Jonathan Ellis
Do you have a key a/b then?  What columns does it have?

On Wed, Oct 27, 2010 at 9:14 AM, Daniel Doubleday
daniel.double...@gmx.net wrote:
 Hm -

 not sure if I understand the random question. We are using RP. But I wouldn't 
 know why that should matter.
 I thought that the bloom filter hash function should evenly distribute no 
 matter what keys come in.

 Keys are '/' separated strings (aka paths :-))

 I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)

 [
        {'a/b/foo': cols},
        {'a/b/bar': cols},
        {'a/b/baz': cols}
 ]

 and before that I would query for 'a/b'. Recursively as in mkdir -p

 If parent paths are missing they would be inserted with the bulk insert.

 The value for BloomFilterFalseRatio has been in the range of 0.19 - 0.59 in 
 the last couple of hours. Mostly around 0.3

 We're on 0.6.6 btw


 On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:

 This is not expected, no.  How random are your queries?  If you have a
 couple outlier rows causing the false positives that are being queried
 over and over then that could just be the luck of the draw.

 On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday
 daniel.double...@gmx.net wrote:
 Hi people

 We are currently moving our second use case from mysql to cassandra. While 
 importing the data (ongoing) I noticed that the BloomFilterFalseRation 
 seems to be pretty high compared to another CF which is in used in 
 production right now.

 Its a hierarchical data model and I cannot avoid to do a read before 
 inserting multiple columns.

 I see a false positive ration of 0.28 while in my other CF it is 0.00025.

 The CF has 5 live sstables whiel I read that ratio. At that time I inserted 
 ~ 200k rows with a total of 1M cols. Row keys are pretty large 
 unfortunately (key.length() ~ 60)

 Just wanted to check if this value is to be expected.



 Thanks,
 Daniel



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com





-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: High BloomFilterFalseRation

2010-10-27 Thread Daniel Doubleday

Ah of course - question makes total sense.

But no: this is not the case: I am not constantly asking the same 
question since the tree is deep enough. Most data nodes are level 5 from 
the root. So the parents getting queried will be different most of the time.


Since the parent nodes are created the queries stop there and don't 
propagate toward the root.


And I am seeing the high values all the time. Best that it gets is 0.15.

Daniel

On 27.10.10 18:37, Mike Malone wrote:
I think he was asking about queries, not data. The data may be 
randomly distributed by way of a hash on the key, but if your queries 
are heavily skewed (e.g., if you query for foo a lot more than 
foo/bar, and foo randomly happens to trigger a false positive) the 
skew in your query pattern could cause a seemingly strange spike in 
false positives.


With a hierarchical data model it's not unlikely that this sort of 
skew exists since you'd tend to query for items towards the root of 
the hierarchy more frequently.


Mike

On Wed, Oct 27, 2010 at 2:14 PM, Daniel Doubleday 
daniel.double...@gmx.net mailto:daniel.double...@gmx.net wrote:


Hm -

not sure if I understand the random question. We are using RP. But
I wouldn't know why that should matter.
I thought that the bloom filter hash function should evenly
distribute no matter what keys come in.

Keys are '/' separated strings (aka paths :-))

I do bulk inserts like: (1000 rows at a time, with ~ 50 cols each)

[
   {'a/b/foo': cols},
   {'a/b/bar': cols},
   {'a/b/baz': cols}
]

and before that I would query for 'a/b'. Recursively as in mkdir -p

If parent paths are missing they would be inserted with the bulk
insert.

The value for BloomFilterFalseRatio has been in the range of 0.19
- 0.59 in the last couple of hours. Mostly around 0.3

We're on 0.6.6 btw


On Oct 27, 2010, at 3:58 PM, Jonathan Ellis wrote:

 This is not expected, no.  How random are your queries?  If you
have a
 couple outlier rows causing the false positives that are being
queried
 over and over then that could just be the luck of the draw.

 On Wed, Oct 27, 2010 at 5:24 AM, Daniel Doubleday
 daniel.double...@gmx.net mailto:daniel.double...@gmx.net wrote:
 Hi people

 We are currently moving our second use case from mysql to
cassandra. While importing the data (ongoing) I noticed that the
BloomFilterFalseRation seems to be pretty high compared to another
CF which is in used in production right now.

 Its a hierarchical data model and I cannot avoid to do a read
before inserting multiple columns.

 I see a false positive ration of 0.28 while in my other CF it
is 0.00025.

 The CF has 5 live sstables whiel I read that ratio. At that
time I inserted ~ 200k rows with a total of 1M cols. Row keys are
pretty large unfortunately (key.length() ~ 60)

 Just wanted to check if this value is to be expected.



 Thanks,
 Daniel



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com