Hi Shalom,
Thanks for your notes! So you also experienced this thing... fine
Then maybe the best rules to follow are these:
a) never(!) run a query "ALLOW FILTERING" on a Production cluster
b) if you need these queries build a test cluster (somehow) and mirror
the data (somehow) OR add denormalized tables (write + code complexity
overhead) to fulfill those queries
Can we agree on this one maybe as a "good to follow" policy?
In our case luckily users = developers always. So I can expect them
being aware of the consequences of a particular query.
We also have test data fully mirrored into a test cluster. So running
those queries on test system is possible.
Plus If for whatever reason we really really need to run such a query in
Prod I can simply instruct them test query like this in the test system
for sure
cheers
Attila Wind
http://www.linkedin.com/in/attilaw
Mobile: +36 31 7811355
On 2019. 05. 28. 8:59, shalom sagges wrote:
Hi Attila,
I'm definitely no guru, but I've experienced several cases where
people at my company used allow filtering and caused major performance
issues.
As data size increases, the impact will be stronger. If you have large
partitions, performance will decrease.
GC can be affected. And if GC stops the world too long for too many
times, you will feel it.
I sincerely believe the best way would be to educate the users and
remodel the data. Perhaps you need to denormalize your tables or at
least use secondary indices (I prefer to keep it as simple as possible
and denormalize).
If it's a cluster for analytics, perhaps you need to build a
designated cluster only for that so if something does break or get too
pressured, normal activities wouldn't be affected, but there are pros
and cons for that idea too.
Hope this helps.
Regards,
On Tue, May 28, 2019 at 9:43 AM Attila Wind <[email protected]>
wrote:
Hi Gurus,
Looks we stopped this thread. However I would be very much curious
answers regarding b) ...
Anyone any comments on that?
I do see this as a potential production outage risk now...
Especially as we are planning to run analysis queries by hand
exactly like that over the cluster...
thanks!
Attila Wind
http://www.linkedin.com/in/attilaw
Mobile: +36 31 7811355
On 2019. 05. 23. 11:42, shalom sagges wrote:
a) Interesting... But only in case you do not provide
partitioning key right? (so IN() is for partitioning key?)
I think you should ask yourself a different question. Why am I
using ALLOW FILTERING in the first place? What happens if I
remove it from the query?
I prefer to denormalize the data to multiple tables or at least
create an index on the requested column (preferably queried
together with a known partition key).
b) Still does not explain or justify "all 8 nodes to halt and
unresponsiveness to external requests" behavior... Even if
servers are busy with the request seriously becoming
non-responsive...?
I think it can justify the unresponsiveness. When using ALLOW
FILTERING, you are doing something like a full table scan in a
relational database.
There is a lot of information on the internet regarding this
subject such as
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
Hope this helps.
Regards,
On Thu, May 23, 2019 at 7:33 AM Attila Wind
<[email protected]> <mailto:[email protected]> wrote:
Hi,
"When you run a query with allow filtering, Cassandra doesn't
know where the data is located, so it has to go node by node,
searching for the requested data."
a) Interesting... But only in case you do not provide
partitioning key right? (so IN() is for partitioning key?)
b) Still does not explain or justify "all 8 nodes to halt and
unresponsiveness to external requests" behavior... Even if
servers are busy with the request seriously becoming
non-responsive...?
cheers
Attila Wind
http://www.linkedin.com/in/attilaw
Mobile: +36 31 7811355
On 2019. 05. 23. 0:37, shalom sagges wrote:
Hi Vsevolod,
1) Why such behavior? I thought any given SELECT request is
handled by a limited subset of C* nodes and not by all of
them, as per connection consistency/table replication
settings, in case.
When you run a query with allow filtering, Cassandra doesn't
know where the data is located, so it has to go node by
node, searching for the requested data.
2) Is it possible to forbid ALLOW FILTERING flag for given
users/groups?
I'm not familiar with such a flag. In my case, I just try to
educate the R&D teams.
Regards,
On Wed, May 22, 2019 at 5:01 PM Vsevolod Filaretov
<[email protected] <mailto:[email protected]>> wrote:
Hello everyone,
We have an 8 node C* cluster with large volume of
unbalanced data. Usual per-partition selects work
somewhat fine, and are processed by limited number of
nodes, but if user issues SELECT WHERE IN () ALLOW
FILTERING, such command stalls all 8 nodes to halt and
unresponsiveness to external requests while disk IO
jumps to 100% across whole cluster. In several minutes
all nodes seem to finish ptocessing the request and
cluster goes back to being responsive. Replication level
across whole data is 3.
1) Why such behavior? I thought any given SELECT request
is handled by a limited subset of C* nodes and not by
all of them, as per connection consistency/table
replication settings, in case.
2) Is it possible to forbid ALLOW FILTERING flag for
given users/groups?
Thank you all very much in advance,
Vsevolod Filaretov.