[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2015-04-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496579#comment-14496579
 ] 

Piotr Kołaczkowski commented on CASSANDRA-6348:
---

@Alex maybe a simple solution would be to allow to disable predicate push down 
for cases where ALLOW FILTERING would be needed?

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu
 Attachments: 6348.txt


 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2015-04-15 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496805#comment-14496805
 ] 

Alex Liu commented on CASSANDRA-6348:
-

Hive has a setting to enable pushdown, by default it's disable. User can enable 
it if the table has only one indexed column. 


 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu
 Attachments: 6348.txt


 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2014-01-06 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863159#comment-13863159
 ] 

Alex Liu commented on CASSANDRA-6348:
-

Add @bcoverston, This issue hits hard on customer if hadoop uses multiple 
indexes.

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu
 Attachments: 6348.txt


 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-22 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830401#comment-13830401
 ] 

Alex Liu commented on CASSANDRA-6348:
-

rowsPerQuery is only used as page size for Index CF during 2i search.

maxColumns is the number of limit clause.  If meanColumns is a big number, then 
filter.maxColumns()/meanColumns is less than 1, rowsPerQuery is 2. The result 
paging size for index CF is 2 which is too small, we end up with too many 
random seeks between index CF and base CF, that's the reason why sometimes 2i 
index search is so slow. We need to avoid the page size of index CF too small. 
The goal is to set page size an enough large number but not too large to avoid 
OOM, so we can have less random seeks between index CF and base CF.

If there is data filtering involved and many base CF columns don't match the 
filter,  the small page size causes the issue even worse for we needs paging 
through more pages in index CF.

{code}
public int maxRows()
{
return countCQL3Rows ? Integer.MAX_VALUE : maxResults;
}

public int maxColumns()
{
return countCQL3Rows ? maxResults : Integer.MAX_VALUE;
}
{code}

for none-cql query,
{code}
rowsPerQuery = Math.max(Math.min(filter.maxResults, 
Integer.MAX_VALUE / meanColumns), 2);
most likely  becomes rowsPerQuery = Math.max(filter.maxResults, 2);
most likely becomes rowsPerQuery = filter.maxResults
which is the same number of rows to fetch
{code}

for cql query
{code}
rowsPerQuery = Math.max(Math.min(Integer.MAX_VALUE, 
filter.maxResults / meanColumns), 2);
most likely  becomes rowsPerQuery = Math.max(filter.maxResults/ 
meanColumns, 2);
most likely becomes rowsPerQuery = filter.maxResults/ meanColumns
if meanColumns is too big, it's a very small number less than 1 
possible.
if no limit clause in cql query, it becomes Integer.MAX_VALUE/ 
meanColumns which is a big number.
{code}

So the question is how to calculate page size for index CF, so we don't have 
too many random seeks between index CF and base CF and void fetching too many 
index columns to avoid OOM.



 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu
 Attachments: 6348.txt


 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-22 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13830415#comment-13830415
 ] 

Alex Liu commented on CASSANDRA-6348:
-

If there is data filtering, for cql query, the total number of index columns 
needed is unknown, and it's not directly related to the limit clause, so we 
can't calculate it based on limit clause. Set it to a magic number which is 
large enough but not too large is a viable solution.

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu
 Attachments: 6348.txt


 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-21 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829445#comment-13829445
 ] 

Alex Liu commented on CASSANDRA-6348:
-

It's interesting that if LIMIT clause is in the query, it's timeout, otherwise 
it's fine.

{code}
cqlsh:cql3ks select key, qty, size  from cf where qty498 and color='red' and 
size = 'P' allow filtering;

 key| qty | size
+-+--
 key_910500 | 499 |P
  key_35500 | 499 |P
 key_945500 | 499 |P
 key_420500 | 499 |P
 key_140500 | 499 |P
 key_630500 | 499 |P
 key_210500 | 499 |P
 key_805500 | 499 |P
 key_700500 | 499 |P
 key_735500 | 499 |P
 key_385500 | 499 |P
 key_175500 | 499 |P
 key_455500 | 499 |P
 key_245500 | 499 |P
 key_770500 | 499 |P
 key_875500 | 499 |P
  key_70500 | 499 |P
 key_980500 | 499 |P
 key_280500 | 499 |P
 key_105500 | 499 |P
 key_525500 | 499 |P
 key_665500 | 499 |P
 key_595500 | 499 |P
 key_315500 | 499 |P
 key_490500 | 499 |P
 key_350500 | 499 |P
key_500 | 499 |P
 key_840500 | 499 |P
 key_560500 | 499 |P

cqlsh:cql3ks select key, qty, size  from cf where qty498 and color='red' and 
size = 'P' limit 1 allow filtering;
Request did not complete within rpc_timeout.
{code}

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-21 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13829534#comment-13829534
 ] 

Alex Liu commented on CASSANDRA-6348:
-

The rowsPerQuery is 2 for the query with limit 1, but it's 38669 for the query 
without limit

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-19 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826335#comment-13826335
 ] 

Sylvain Lebresne commented on CASSANDRA-6348:
-

bq. Other than hadoop queries, It's common for user to query on multiple indexes

I sure hope you're wrong and for sure it shoudn't be, because Cassandra sucks 
at it. And I personally have almost never seen anyone use it (on the mailing 
list for instance). 

ALLOW FILTERING is really meant as a don't do unless you're just having fun 
with cqlsh on a toy database. Using ALLOW FILTERING on real production queries 
is wrong (at least for CQL queries, I'm not talking about Hadoop, which is a 
different problem). I'm more than happy to make the document/message more clear 
about that fact if it's not.

bq. Hadoop Cql query uses ALLOW FILTERING

Which is kind of a problem in the sense that it's not what ALLOW FILTERING has 
been intended for and that more generally CQL has never been designed with 
Hadoop in mind, it's a strictly real-time oriented language. So maybe we should 
re-purpose ALLOW FILTERING as the hadoop mode somehow, but if we do, we 
should be a explicit about it and think about how to do that best. But trying 
to shove Hadoop into something it hasn't been made for feels wrong to me.

That being said, I wonder if an overall simpler solution to the Hadoop wants 
to use the 2dnary indexes problem couldn't be better solves by letting it 
query the 2ndary index CFS directly. That is, allow selects on the index itself 
(which would obviously require a special flag to unlock). That way, Hadoop 
would get paging over the index for free (which at the end of the day is the 
problem that needs solving if I understand it correctly) and would get control 
over that paging. And it would allow Hadoop to do things like merging indexes 
that probably make more sense on the Hadoop side that it makes on the realtime 
side (i.e. we keep Cassandra focuses on on realtime queries with as little 
processing as possible, which is what it is good at).


 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-19 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826809#comment-13826809
 ] 

Alex Liu commented on CASSANDRA-6348:
-

C* is not alone, PostgreSQL has the similar filter predicates issue -- index 
Filter Predicate
http://use-the-index-luke.com/sql/explain-plan/postgresql/filter-predicates 

{code}
Note

Index filter predicates give a false sense of safety; even though an index is 
used, the performance degrades rapidly on a growing data volume or system load.
{code}



 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-19 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13826860#comment-13826860
 ] 

Alex Liu commented on CASSANDRA-6348:
-

One solution is to implement a query execution planner, so that it can optimize 
the execution path to avoid the bad performance or look for best performance. 
For data filtering, if the index size is above certain threshold we disable 
index scanning, and do a full table scanning or index merging.

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-18 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825401#comment-13825401
 ] 

Sylvain Lebresne commented on CASSANDRA-6348:
-

Hum, can't really reproduce on the cassandra-1.2 branch:
{noformat}
Connected to test at 127.0.0.1:9160.
[cqlsh 3.1.8 | Cassandra 1.2.11-SNAPSHOT | CQL spec 3.0.0 | Thrift protocol 
19.36.1]
Use HELP for help.
cqlsh create KEYSPACE ks WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': 1};
cqlsh use ks;
cqlsh:ks   create table test ( key1 int, key2 int , col1 int, col2 int, 
primary key (key1, key2));
cqlsh:ks   create index col1 on test(col1);
cqlsh:ks   create index col2 on test(col2);
cqlsh:ks select * from test where col1=100 and col2 =1;
Bad Request: Cannot execute this query as it might involve data filtering and 
thus may have unpredictable performance. If you want to execute this query 
despite the performance unpredictability, use ALLOW FILTERING
{noformat}
I.e. ALLOW FILTERING does is required.

bq. We can either disable those kind of queries or WARN the user that data 
filtering might lead to timeout exception or OOM.

Just to make sure we agree, that's *exactly* what requiring ALLOW FILTERING is 
about, warning the user that C* does not execute the query smartly and that the 
performance will suck. You should *never* use ALLOW FILTERING in production 
unless you know very well what you do in particular.

bq. We should be able to auto page through 2i CF (for native protocol), so if 
the auto-paging ends in the middle of a index scanning

This is not really what the native protocol paging is about. If you ask pages 
of 1000 results, the native protocol paging will return you pages of 1000 
results until you're done paging. In that case, the point is that it takes a 
long time to find any results at all because the way we handle the query is 
dumb.  But I'll note that we do page internally the index scanning (which is 
why you can get a timeout but in theory not an OOM).

Note that I'm not saying we shouldn't improve the way we handle such queries, 
but that's a whole separate issue (CASSANDRA-6048).


 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-18 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13825496#comment-13825496
 ] 

Alex Liu commented on CASSANDRA-6348:
-

I forgot to put the ALLOW FILTERING in the clauses. The issue is raised 
during the Hadoop performance testing on indexed columns(The test case indexes 
on the columns which results in too big index). Hadoop Cql query uses ALLOW 
FILTERING, user can provide user defined where clauses which might have data 
filtering on multiple columns. But the hadoop user may not understand fully 
what does data filtering work under the hood.

 Other than hadoop queries, It's common for user to query on multiple indexes, 
we should explain more detail about when the ALLOW FILTERING results in bad 
performance and which case leads to timeout exception in the following 
exception. 

{code}
Cannot execute this query as it might involve data filtering and thus may have 
unpredictable performance. If you want to execute this query despite the 
performance unpredictability, use ALLOW FILTERING
{code}

For most of the cases, ALLOW FILTERING improves performance. We can't assume 
that user can fully understand ALLOW FILTERING under the hood. I even spend 
quite some time on CASSANDRA-6048 to understand more about data filtering.



 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-15 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823528#comment-13823528
 ] 

Sylvain Lebresne commented on CASSANDRA-6348:
-

What version is that test case against? Because requiring ALLOW FILTERING is 
definitively the intent of the following code from SelectStatement:
{noformat}
// Make sure this queries is allowed (note: only key range can involve 
filtering underneath)
if (!parameters.allowFiltering  stmt.isKeyRange)
{
// We will potentially filter data if either:
//  - Have more than one IndexExpression
//  - Have no index expression and the column filter is not the identity
if (stmt.metadataRestrictions.size()  1 || 
(stmt.metadataRestrictions.isEmpty()  !stmt.columnFilterIsIdentity()))
throw new InvalidRequestException(Cannot execute this query as it 
might involve data filtering and thus may have unpredictable performance. 
+ If you want to execute this query 
despite the performance unpredictability, use ALLOW FILTERING);
}
{noformat}

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-15 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823819#comment-13823819
 ] 

Alex Liu commented on CASSANDRA-6348:
-

It was tested against on 1.2.11 release. 

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-15 Thread Alex Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823830#comment-13823830
 ] 

Alex Liu commented on CASSANDRA-6348:
-

We should be able to auto page through 2i CF (for native protocol), so if the 
auto-paging ends in the middle of a index scanning, the next page should start 
from where the index scanning ends in the previous page.

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (CASSANDRA-6348) TimeoutException throws if Cql query allows data filtering and index is too big and it can't find the data in base CF after filtering

2013-11-14 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13823034#comment-13823034
 ] 

Jonathan Ellis commented on CASSANDRA-6348:
---

Guess we need to require ALLOW FILTERING for these.

 TimeoutException throws if Cql query allows data filtering and index is too 
 big and it can't find the data in base CF after filtering 
 --

 Key: CASSANDRA-6348
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6348
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Reporter: Alex Liu
Assignee: Alex Liu

 If index row is too big, and filtering can't find the match Cql row in base 
 CF, it keep scanning the index row and retrieving base CF until the index row 
 is scanned completely which may take too long and thrift server returns 
 TimeoutException. This is one of the reasons why we shouldn't index a column 
 if the index is too big.
 Multiple indexes merging can resolve the case where there are only EQUAL 
 clauses. (CASSANDRA-6048 addresses it).
 If the query has none-EQUAL clauses, we still need do data filtering which 
 might lead to timeout exception.
 We can either disable those kind of queries or WARN the user that data 
 filtering might lead to timeout exception or OOM.



--
This message was sent by Atlassian JIRA
(v6.1#6144)