On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <atul.sar...@snapdeal.com>
wrote:

> I have concern over using secondary index on field with low cardinality.
> Lets say I have few billion rows and each row can be classified in 1000
> category. Lets say we have 50 node cluster.
>
> Now we want to fetch data for a single category using secondary index over
> a category. And query is paginated too with fetch size property say 5000.
>
> Since query on secondary index works as scatter and gatherer approach by
> coordinator node. Would it lead to out of memory on coordinator or timeout
> errors too much.
>

Paging will prevent the coordinator from using excessive memory.  With the
type of data that you described, timeouts shouldn't be huge problem because
it will only take a few token ranges (assuming you're using vnodes) to get
enough matching rows to hit the page size.


>
> How does pagination (token level data fetch) behave in scatter and
> gatherer approach?
>

Secondary index queries fetch token ranges in sequential order [1],
starting with the minimum token.  When you fetch a new page, it resumes
from the last token (and primary key) that it returned in the previous page.

[1] As an optimization, multiple token ranges will be fetched in parallel
based on estimates of how many token ranges it will take to fill the page.


>
> Secondly, What If we create an inverted table with partition key as
> category. Then this will led to lots of data on single node. Then it might
> led to hot shard issue and performance issue of data fetching from single
> node as a single partition has  millions of rows.
>
> How should we tackle such low cardinality index in Cassandra?


The data distribution that you described sounds like a reasonable fit for
secondary indexes.  However, I would also take into account how frequently
you run this query and how fast you need it to be.  Even ignoring the
scatter-gather aspects of a secondary index query, they are still expensive
because they fetch many non-contiguous rows from an SSTable.  If you need
to run this query very frequently, that may add too much load to your
cluster, and some sort of inverted table approach may be more appropriate.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Reply via email to