Thank you all for your insights.

When spark-connector adds allows filtering to a query, it makes the query to 
just ‘run’ no matter if it is expensive for larger table OR  not so expensive 
for table with fewer rows.
In my particular case, nodes are reaching 2TB/per node load in 50 node cluster. 
When bunch of such queries run ,  causes impact on server resources.

Since allow filtering is an expensive operation - I’m trying find knobs which 
if I turn, mitigate the impact.

What I think , correct me if I am wrong , is – it is query design itself which 
is not optimized per table design  - that in turn causing connector to add 
allow filtering implicitly.  I’m not thinking to add secondary indexes on 
tables because they’ve their own overheads.  kindly share if there are  other 
means which we can use to influence connector not to use allow filtering.

Thanks again.
Asad



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Thursday, July 25, 2019 10:24 AM
To: cassandra <user@cassandra.apache.org>
Subject: Re: Performance impact with ALLOW FILTERING clause.

"unpredictable" is such a loaded word. It's quite predictable, but it's often 
mispredicted by users.

"ALLOW FILTERING" basically tells the database you're going to do a query that 
will require scanning a bunch of data to return some subset of it, and you're 
not able to provide a WHERE clause that's sufficiently fine grained to avoid 
the scan. It's a loose equivalent of doing a full table scan in SQL databases - 
sometimes it's a valid use case, but it's expensive, you're ignoring all of the 
indexes, and you're going to do a lot more work.

It's predictable, though - you're probably going to walk over some range of 
data. Spark is grabbing all of the data to load into RDDs, and it probably does 
it by slicing up the range, doing a bunch of range scans.

It's doing that so it can get ALL of the data and do the filtering / joining / 
searching in-memory in spark, rather than relying on cassandra to do the 
scanning/searching on disk.

On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A 
<az1...@att.com<mailto:az1...@att.com>> wrote:
Hello Folks,

I was going thru documentation and saw at many places saying ALLOW FILTERING 
causes performance unpredictability.  Our developers says ALLOW FILTERING 
clause is implicitly added on bunch of queries by spark-Cassandra  connector 
and they cannot control it; however at the same time we see unpredictability in 
application performance – just as documentation says.

I’m trying to understand why would a connector add a clause in query when this 
can cause negative impact on database/application performance. Is that data 
model that is driving connector make its decision and add allow filtering to 
query automatically or if there are other reason this clause is added to the 
code. I’m not a developer though I want to know why developer don’t have any 
control on this to happen.

I’ll appreciate your guidance here.

Thanks
Asad


Reply via email to