Re: Performance impact with ALLOW FILTERING clause.

2019-08-17 Thread Devopam Mittra
Hi Asad,
Seems to me that your development team will need to remodel the tables
sooner than later. This problem can't be left unattended for long once it
starts hitting severely.
The way Cassandra is, you may want to have them replicate the same table
with different PK / structure to suitably embed a WHERE clause in the base
query if nothing else works out.

Allow filtering is best avoided for routine queries or at max good for
ad-hoc analysis not involving arithmetic operation (like count/sum) .

Regards
Devopam


On Thu, Jul 25, 2019, 7:19 PM ZAIDI, ASAD A  wrote:

> Hello Folks,
>
>
>
> I was going thru documentation and saw at many places saying ALLOW
> FILTERING causes performance unpredictability.  Our developers says ALLOW
> FILTERING clause is implicitly added on bunch of queries by spark-Cassandra
>  connector and they cannot control it; however at the same time we see
> unpredictability in application performance – just as documentation says.
>
>
>
> I’m trying to understand why would a connector add a clause in query when
> this can cause negative impact on database/application performance. Is that
> data model that is driving connector make its decision and add allow
> filtering to query automatically or if there are other reason this clause
> is added to the code. I’m not a developer though I want to know why
> developer don’t have any control on this to happen.
>
>
>
> I’ll appreciate your guidance here.
>
>
>
> Thanks
>
> Asad
>
>
>
>
>


Re: Performance impact with ALLOW FILTERING clause.

2019-08-17 Thread Alex Ott
Spark connector doesn't do the "select * from table;" - it does reads by
token ranges, reading the data
(see 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/partitioner/CassandraPartition.scala#L14)
 


Jacques-Henri Berthemet  at "Thu, 25 Jul 2019 14:18:57 +" wrote:
 JB> Hi Asad,

 JB> That’s because of the way Spark works. Essentially, when you execute a 
Spark job, it pulls the full content of the datastore (Cassandra
 JB> in your case) in it RDDs and works with it “in memory”. While Spark uses 
“data locality” to read data from the nodes that have the
 JB> required data on its local disks, it’s still reading all data from 
Cassandra tables. To do so it’s sending ‘select * from Table ALLOW
 JB> FILTERING’ query to Cassandra.

 JB> From Spark you don’t have much control on the initial query to fill the 
RDDs, sometimes you’ll read the whole table even if you only
 JB> need one row.

 JB> Regards,

 JB> Jacques-Henri Berthemet

 JB> From: "ZAIDI, ASAD A" 
 JB> Reply to: "user@cassandra.apache.org" 
 JB> Date: Thursday 25 July 2019 at 15:49
 JB> To: "user@cassandra.apache.org" 
 JB> Subject: Performance impact with ALLOW FILTERING clause.

 JB> Hello Folks,

 JB> I was going thru documentation and saw at many places saying ALLOW 
FILTERING causes performance unpredictability.  Our developers says
 JB> ALLOW FILTERING clause is implicitly added on bunch of queries by 
spark-Cassandra  connector and they cannot control it; however at the
 JB> same time we see unpredictability in application performance – just as 
documentation says.  

 JB> I’m trying to understand why would a connector add a clause in query when 
this can cause negative impact on database/application
 JB> performance. Is that data model that is driving connector make its 
decision and add allow filtering to query automatically or if there
 JB> are other reason this clause is added to the code. I’m not a developer 
though I want to know why developer don’t have any control on
 JB> this to happen.

 JB> I’ll appreciate your guidance here.

 JB> Thanks

 JB> Asad



-- 
With best wishes,Alex Ott
Solutions Architect EMEA, DataStax
http://datastax.com/

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Performance impact with ALLOW FILTERING clause.

2019-07-26 Thread Christian Lorenz
Hi,

did you also consider to “tame” your spark job by reducing it’s executors? 
Probably the Job will have a longer runtime in exchange to reducing the stress 
on the Cassandra cluster.

Regards
Christian

Von: "ZAIDI, ASAD A" 
Antworten an: "user@cassandra.apache.org" 
Datum: Donnerstag, 25. Juli 2019 um 20:05
An: "user@cassandra.apache.org" 
Betreff: RE: Performance impact with ALLOW FILTERING clause.

Thank you all for your insights.

When spark-connector adds allows filtering to a query, it makes the query to 
just ‘run’ no matter if it is expensive for larger table OR  not so expensive 
for table with fewer rows.
In my particular case, nodes are reaching 2TB/per node load in 50 node cluster. 
When bunch of such queries run ,  causes impact on server resources.

Since allow filtering is an expensive operation - I’m trying find knobs which 
if I turn, mitigate the impact.

What I think , correct me if I am wrong , is – it is query design itself which 
is not optimized per table design  - that in turn causing connector to add 
allow filtering implicitly.  I’m not thinking to add secondary indexes on 
tables because they’ve their own overheads.  kindly share if there are  other 
means which we can use to influence connector not to use allow filtering.

Thanks again.
Asad



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Thursday, July 25, 2019 10:24 AM
To: cassandra 
Subject: Re: Performance impact with ALLOW FILTERING clause.

"unpredictable" is such a loaded word. It's quite predictable, but it's often 
mispredicted by users.

"ALLOW FILTERING" basically tells the database you're going to do a query that 
will require scanning a bunch of data to return some subset of it, and you're 
not able to provide a WHERE clause that's sufficiently fine grained to avoid 
the scan. It's a loose equivalent of doing a full table scan in SQL databases - 
sometimes it's a valid use case, but it's expensive, you're ignoring all of the 
indexes, and you're going to do a lot more work.

It's predictable, though - you're probably going to walk over some range of 
data. Spark is grabbing all of the data to load into RDDs, and it probably does 
it by slicing up the range, doing a bunch of range scans.

It's doing that so it can get ALL of the data and do the filtering / joining / 
searching in-memory in spark, rather than relying on cassandra to do the 
scanning/searching on disk.

On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A 
mailto:az1...@att.com>> wrote:
Hello Folks,

I was going thru documentation and saw at many places saying ALLOW FILTERING 
causes performance unpredictability.  Our developers says ALLOW FILTERING 
clause is implicitly added on bunch of queries by spark-Cassandra  connector 
and they cannot control it; however at the same time we see unpredictability in 
application performance – just as documentation says.

I’m trying to understand why would a connector add a clause in query when this 
can cause negative impact on database/application performance. Is that data 
model that is driving connector make its decision and add allow filtering to 
query automatically or if there are other reason this clause is added to the 
code. I’m not a developer though I want to know why developer don’t have any 
control on this to happen.

I’ll appreciate your guidance here.

Thanks
Asad




Re: Performance impact with ALLOW FILTERING clause.

2019-07-25 Thread Jon Haddad
If you're thinking about rewriting your data to be more performant when
doing analytics, you might as well go the distance and put it in an
analytics friendly format like Parquet.  My 2 cents.

On Thu, Jul 25, 2019 at 11:01 AM ZAIDI, ASAD A  wrote:

> Thank you all for your insights.
>
>
>
> When spark-connector adds allows filtering to a query, it makes the query
> to just ‘run’ no matter if it is expensive for larger table OR  not so
> expensive for table with fewer rows.
>
> In my particular case, nodes are reaching 2TB/per node load in 50 node
> cluster. When bunch of such queries run ,  causes impact on server
> resources.
>
>
>
> Since allow filtering is an expensive operation - I’m trying find knobs
> which if I turn, mitigate the impact.
>
>
>
> What I think , correct me if I am wrong , is – it is query design itself
> which is not optimized per table design  - that in turn causing connector
> to add allow filtering implicitly.  I’m not thinking to add secondary
> indexes on tables because they’ve their own overheads.  kindly share if
> there are  other means which we can use to influence connector not to use
> allow filtering.
>
>
>
> Thanks again.
>
> Asad
>
>
>
>
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
> *Sent:* Thursday, July 25, 2019 10:24 AM
> *To:* cassandra 
> *Subject:* Re: Performance impact with ALLOW FILTERING clause.
>
>
>
> "unpredictable" is such a loaded word. It's quite predictable, but it's
> often mispredicted by users.
>
>
>
> "ALLOW FILTERING" basically tells the database you're going to do a query
> that will require scanning a bunch of data to return some subset of it, and
> you're not able to provide a WHERE clause that's sufficiently fine grained
> to avoid the scan. It's a loose equivalent of doing a full table scan in
> SQL databases - sometimes it's a valid use case, but it's expensive, you're
> ignoring all of the indexes, and you're going to do a lot more work.
>
>
>
> It's predictable, though - you're probably going to walk over some range
> of data. Spark is grabbing all of the data to load into RDDs, and it
> probably does it by slicing up the range, doing a bunch of range scans.
>
>
>
> It's doing that so it can get ALL of the data and do the filtering /
> joining / searching in-memory in spark, rather than relying on cassandra to
> do the scanning/searching on disk.
>
>
>
> On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A  wrote:
>
> Hello Folks,
>
>
>
> I was going thru documentation and saw at many places saying ALLOW
> FILTERING causes performance unpredictability.  Our developers says ALLOW
> FILTERING clause is implicitly added on bunch of queries by spark-Cassandra
>  connector and they cannot control it; however at the same time we see
> unpredictability in application performance – just as documentation says.
>
>
>
> I’m trying to understand why would a connector add a clause in query when
> this can cause negative impact on database/application performance. Is that
> data model that is driving connector make its decision and add allow
> filtering to query automatically or if there are other reason this clause
> is added to the code. I’m not a developer though I want to know why
> developer don’t have any control on this to happen.
>
>
>
> I’ll appreciate your guidance here.
>
>
>
> Thanks
>
> Asad
>
>
>
>
>
>


RE: Performance impact with ALLOW FILTERING clause.

2019-07-25 Thread ZAIDI, ASAD A
Thank you all for your insights.

When spark-connector adds allows filtering to a query, it makes the query to 
just ‘run’ no matter if it is expensive for larger table OR  not so expensive 
for table with fewer rows.
In my particular case, nodes are reaching 2TB/per node load in 50 node cluster. 
When bunch of such queries run ,  causes impact on server resources.

Since allow filtering is an expensive operation - I’m trying find knobs which 
if I turn, mitigate the impact.

What I think , correct me if I am wrong , is – it is query design itself which 
is not optimized per table design  - that in turn causing connector to add 
allow filtering implicitly.  I’m not thinking to add secondary indexes on 
tables because they’ve their own overheads.  kindly share if there are  other 
means which we can use to influence connector not to use allow filtering.

Thanks again.
Asad



From: Jeff Jirsa [mailto:jji...@gmail.com]
Sent: Thursday, July 25, 2019 10:24 AM
To: cassandra 
Subject: Re: Performance impact with ALLOW FILTERING clause.

"unpredictable" is such a loaded word. It's quite predictable, but it's often 
mispredicted by users.

"ALLOW FILTERING" basically tells the database you're going to do a query that 
will require scanning a bunch of data to return some subset of it, and you're 
not able to provide a WHERE clause that's sufficiently fine grained to avoid 
the scan. It's a loose equivalent of doing a full table scan in SQL databases - 
sometimes it's a valid use case, but it's expensive, you're ignoring all of the 
indexes, and you're going to do a lot more work.

It's predictable, though - you're probably going to walk over some range of 
data. Spark is grabbing all of the data to load into RDDs, and it probably does 
it by slicing up the range, doing a bunch of range scans.

It's doing that so it can get ALL of the data and do the filtering / joining / 
searching in-memory in spark, rather than relying on cassandra to do the 
scanning/searching on disk.

On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A 
mailto:az1...@att.com>> wrote:
Hello Folks,

I was going thru documentation and saw at many places saying ALLOW FILTERING 
causes performance unpredictability.  Our developers says ALLOW FILTERING 
clause is implicitly added on bunch of queries by spark-Cassandra  connector 
and they cannot control it; however at the same time we see unpredictability in 
application performance – just as documentation says.

I’m trying to understand why would a connector add a clause in query when this 
can cause negative impact on database/application performance. Is that data 
model that is driving connector make its decision and add allow filtering to 
query automatically or if there are other reason this clause is added to the 
code. I’m not a developer though I want to know why developer don’t have any 
control on this to happen.

I’ll appreciate your guidance here.

Thanks
Asad




Re: Performance impact with ALLOW FILTERING clause.

2019-07-25 Thread Jeff Jirsa
"unpredictable" is such a loaded word. It's quite predictable, but it's
often mispredicted by users.

"ALLOW FILTERING" basically tells the database you're going to do a query
that will require scanning a bunch of data to return some subset of it, and
you're not able to provide a WHERE clause that's sufficiently fine grained
to avoid the scan. It's a loose equivalent of doing a full table scan in
SQL databases - sometimes it's a valid use case, but it's expensive, you're
ignoring all of the indexes, and you're going to do a lot more work.

It's predictable, though - you're probably going to walk over some range of
data. Spark is grabbing all of the data to load into RDDs, and it probably
does it by slicing up the range, doing a bunch of range scans.

It's doing that so it can get ALL of the data and do the filtering /
joining / searching in-memory in spark, rather than relying on cassandra to
do the scanning/searching on disk.

On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A  wrote:

> Hello Folks,
>
>
>
> I was going thru documentation and saw at many places saying ALLOW
> FILTERING causes performance unpredictability.  Our developers says ALLOW
> FILTERING clause is implicitly added on bunch of queries by spark-Cassandra
>  connector and they cannot control it; however at the same time we see
> unpredictability in application performance – just as documentation says.
>
>
>
> I’m trying to understand why would a connector add a clause in query when
> this can cause negative impact on database/application performance. Is that
> data model that is driving connector make its decision and add allow
> filtering to query automatically or if there are other reason this clause
> is added to the code. I’m not a developer though I want to know why
> developer don’t have any control on this to happen.
>
>
>
> I’ll appreciate your guidance here.
>
>
>
> Thanks
>
> Asad
>
>
>
>
>


Re: Performance impact with ALLOW FILTERING clause.

2019-07-25 Thread Jacques-Henri Berthemet
Hi Asad,

That’s because of the way Spark works. Essentially, when you execute a Spark 
job, it pulls the full content of the datastore (Cassandra in your case) in it 
RDDs and works with it “in memory”. While Spark uses “data locality” to read 
data from the nodes that have the required data on its local disks, it’s still 
reading all data from Cassandra tables. To do so it’s sending ‘select * from 
Table ALLOW FILTERING’ query to Cassandra.

From Spark you don’t have much control on the initial query to fill the RDDs, 
sometimes you’ll read the whole table even if you only need one row.

Regards,
Jacques-Henri Berthemet

From: "ZAIDI, ASAD A" 
Reply to: "user@cassandra.apache.org" 
Date: Thursday 25 July 2019 at 15:49
To: "user@cassandra.apache.org" 
Subject: Performance impact with ALLOW FILTERING clause.

Hello Folks,

I was going thru documentation and saw at many places saying ALLOW FILTERING 
causes performance unpredictability.  Our developers says ALLOW FILTERING 
clause is implicitly added on bunch of queries by spark-Cassandra  connector 
and they cannot control it; however at the same time we see unpredictability in 
application performance – just as documentation says.

I’m trying to understand why would a connector add a clause in query when this 
can cause negative impact on database/application performance. Is that data 
model that is driving connector make its decision and add allow filtering to 
query automatically or if there are other reason this clause is added to the 
code. I’m not a developer though I want to know why developer don’t have any 
control on this to happen.

I’ll appreciate your guidance here.

Thanks
Asad




Performance impact with ALLOW FILTERING clause.

2019-07-25 Thread ZAIDI, ASAD A
Hello Folks,

I was going thru documentation and saw at many places saying ALLOW FILTERING 
causes performance unpredictability.  Our developers says ALLOW FILTERING 
clause is implicitly added on bunch of queries by spark-Cassandra  connector 
and they cannot control it; however at the same time we see unpredictability in 
application performance – just as documentation says.

I’m trying to understand why would a connector add a clause in query when this 
can cause negative impact on database/application performance. Is that data 
model that is driving connector make its decision and add allow filtering to 
query automatically or if there are other reason this clause is added to the 
code. I’m not a developer though I want to know why developer don’t have any 
control on this to happen.

I’ll appreciate your guidance here.

Thanks
Asad