Cassandra is good at two kinds of queries: 1) access a specific row by a
specific key, and 2) Access a slice or consecutive sequence of rows within
a given partition.

It is recommended to avoid ALLOW FILTERING. If it happens to work well for
you, great, go for it, but if it doesn't then simply don't do it. Best to
redesign your data model to play to Cassandra's strengths.

If you bucket the time-based table, do a separate query for each time
bucket.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
guilla...@databerries.com> wrote:

> Kai, Jack,
>
> On 1., should the bucket be a STRING with a date format or do I have a
> better option ? For (device_id, bucket, timestamp), did you mean
> ((device_id, bucket), timestamp) ?
>
> On 2., what are the risks of timeout ? I currently have this warning:
> "Cannot execute this query as it might involve data filtering and thus may
> have unpredictable performance. If you want to execute this query despite
> the performance unpredictability, use ALLOW FILTERING".
>
> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <dep...@gmail.com> wrote:
>
>> 1. Don't make your partition unbound. It's tempting to just use
>> (device_id, timestamp). But soon or later you will have problem when time
>> goes by. You can keep the partition bound by using (device_id, bucket,
>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>> on the size of data.
>>
>> 2. As to your specific query, for a given partition and a time range, C*
>> doesn't need to load the whole partition then filter. It only retrieves the
>> slice within the time range from disk because the data is clustered by
>> timestamp.
>>
>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <jack.krupan...@gmail.com>
>> wrote:
>>
>>> The general rule in Cassandra data modeling is to look at all of your
>>> queries first and then to declare a table for each query, even if that
>>> means storing multiple copies of the data. So, create a second table with
>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>>> device as the clustering keys.
>>>
>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
>>> want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
>>> plugin.
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>> guilla...@databerries.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are currently storing geolocation events (about 1 per 5 minutes) for
>>>> each device we track. We currently have 2 TB of data. I would like to store
>>>> the device_id, the timestamp of the event, latitude and longitude. I though
>>>> about using the device_id as the partition key and timestamp as the
>>>> clustering column. It is great as events are naturally grouped by device
>>>> (very useful for our Spark jobs). However, if I would like to retrieve all
>>>> events of all devices of the last week I understood that Cassandra will
>>>> need to load all data and filter which does not seems to be clean on the
>>>> long term.
>>>>
>>>> How should I create my model?
>>>>
>>>> Best Regards
>>>>
>>>
>>>
>>
>

Reply via email to