Hi,
I did an analysis around some bloom_fileter_fp_chance values to analyse
size of bloom filter created when setting different values. For 1 million
dataset I got following data:
*1).* For bloom_filter_fp_chance = .01(default) : *1.20 MB (1,259,720
bytes*)
* 2).* For bloom_filter_fp_chance = .001 : *1.80 MB (1,889,576 bytes)*
* 3).* For bloom_filter_fp_chance = .0001 :
*2.40 MB (2,519,424 bytes)*
*NOTE: *size of bloom filter may differ with the size of partition key.
>From bloom_filter_fp_chance = .00001, it starts throwing following
exception at the time of flushing (nodetool flush):
java.lang.UnsupportedOperationException: Unable to satisfy 1.0E-5 with 20
buckets per element
at
org.apache.cassandra.utils.BloomCalculations.computeBloomSpec(BloomCalculations.java:150)
.....
In above scenario I was able to query data from cqlsh, but the problem was
data C* was not able to flush data. It was just creating two files
(*-Data.db and *-index.db) of size 0kb.
I checked BloomCalculations class following lines are responsible for this
exception:
*if (maxFalsePosProb < probs[maxBucketsPerElement][maxK]) {*
* throw new UnsupportedOperationException(String.format("Unable to
satisfy %s with %s buckets per element",*
* maxFalsePosProb,
maxBucketsPerElement));*
* }*
I have following confusions:
1). Is there any other way to configure no of buckets along with
bloom_fileter_fp_chance, to avoid this exception?
2). If this validation is hard coaded then why it is even allowed to set
such value of bloom_fileter_fp_chance, that can prevent ssTable generation.
Thanks,
Adarsh
On Fri, May 20, 2016 at 3:18 AM, Oleksandr Petrov <
[email protected]> wrote:
> Bloom filters are used to avoid disk seeks on accessing sstables. As we
> don't know where exactly the partition resides, we have to narrow down the
> search to paticular sstables where the data most probably is.
>
> Given that most likely you won't store 50B rows on the single node, you
> will most likely have a larger cluster.
>
> I would start with writing data (possibly test payloads) rather than tweak
> params. More often than not such optimization a may have reverse effects.
>
> better data modeling is extremely important though: picking right
> partition key, having good clustering key layout will help to run queries
> in the most performant way.
>
> With regard to compaction strategy you may want to read up something
> similar to
> https://www.instaclustr.com/blog/2016/01/27/apache-cassandra-compaction/
> where several compaction strategies are compared and explained in details.
>
> Start with defaults, tweak params that make sense depending on read write
> workloads under realistic stress test payloads, measure results and make
> decisions. If there was a particular setting that would suddenly make
> Cassandra much better and faster it would have been always on by default.
> On Thu, 19 May 2016 at 16:55, Kai Wang <[email protected]> wrote:
>
>> with 50 bln rows and bloom_filter_fp_chance = 0.01, bloom filter will
>> consume a lot of off heap memory. You may want to take that into
>> consideration too.
>>
>> On Wed, May 18, 2016 at 11:53 PM, Adarsh Kumar <[email protected]>
>> wrote:
>>
>>> Hi Sai,
>>>
>>> We have a use case where we are designing a table that is going to have
>>> around 50 billion rows and we require a very fast reads. Partitions are not
>>> that complex/big, it has
>>> some validation data for duplicate checks (consisting 4-5 int and
>>> varchar). So we were trying various options to optimize read performance.
>>> Apart from tuning Bloom Filter we are trying following thing:
>>>
>>> 1). Better data modelling (making appropriate partition and clustering
>>> keys)
>>> 2). Trying Leveled compaction (changing data model for this one)
>>>
>>> Jonathan,
>>>
>>> I understand that tuning bloom_filter_fp_chance will not have a drastic
>>> performance gain.
>>> But this is one of the many tings we are trying.
>>> Please let me know if you have any other suggestions to improve read
>>> performance for this volume of data.
>>>
>>> Also please let me know any performance benchmark technique (currently
>>> we are planing to trigger massive reads from spark and check cfstats).
>>>
>>> NOTE: we will be deploying DSE on EC2, so please suggest if you have
>>> anything specific to DSE and EC2.
>>>
>>> Adarsh
>>>
>>> On Wed, May 18, 2016 at 9:45 PM, Jonathan Haddad <[email protected]>
>>> wrote:
>>>
>>>> The impact is it'll get massively bigger with very little performance
>>>> benefit, if any.
>>>>
>>>> You can't get 0 because it's a probabilistic data structure. It tells
>>>> you either:
>>>>
>>>> your data is definitely not here
>>>> your data has a pretty decent chance of being here
>>>>
>>>> but never "it's here for sure"
>>>>
>>>> https://en.wikipedia.org/wiki/Bloom_filter
>>>>
>>>> On Wed, May 18, 2016 at 11:04 AM sai krishnam raju potturi <
>>>> [email protected]> wrote:
>>>>
>>>>> hi Adarsh;
>>>>> were there any drawbacks to setting the bloom_filter_fp_chance to
>>>>> the default value?
>>>>>
>>>>> thanks
>>>>> Sai
>>>>>
>>>>> On Wed, May 18, 2016 at 2:21 AM, Adarsh Kumar <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> What is the impact of setting bloom_filter_fp_chance < 0.01.
>>>>>>
>>>>>> During performance tuning I was trying to tune bloom_filter_fp_chance
>>>>>> and have following questions:
>>>>>>
>>>>>> 1). Why bloom_filter_fp_chance = 0 is not allowed. (
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-5013)
>>>>>> 2). What is the maximum/recommended value of bloom_filter_fp_chance
>>>>>> (if we do not have any limitation for bloom filter size).
>>>>>>
>>>>>> NOTE: We are using default SizeTieredCompactionStrategy on
>>>>>> cassandra 2.1.8.621
>>>>>>
>>>>>> Thanks in advance..:)
>>>>>>
>>>>>> Adarsh Kumar
>>>>>>
>>>>>
>>>>>
>>>
>> --
> Alex Petrov
>