Re: Cassandra filter with ordering query modeling

2018-03-01 Thread Valentina Crisan
1) I created another table for Query#2/3. The partition Key was StartTime
and clustering key was name. When I execute my queries, I get an exception
saying that I need to ALLOW FILTERING.

*Primary key(startTime,name) - the only queries that can be answered by
this model are: where startTime = , where startTime IN (value1, value2),
where startTime = and name = . Clustering keys support =,<,<=,>,>= while
partition key supports = and IN operators. *
*Your query was with name first and then startTime so in this case
Cassandra is telling you that cannot answer this unless you use Allow
Filtering at the end of the query = which basically is a disaster for
performance since will bring all data in the coordinator and perform local
filtering of the data. So, the model is not good for this query. *

2) I created a table with Name as partitioning key and startTime as
clustering key. This way I was able to order the data in descending order
based on startTime. But the problem was that if a row with same "name" was
inserted, it was overriding the previously inserted row.

*In Cassandra the primary key has 2 main purposes: to answer the queries
and to provide uniqueness for the entries. This means that every variation
of ( name, startTime) should be unique otherwise Cassandra will overwrite
existing values ( actually C* doesn't read before write by default) and
write the new values. In your case name in combination with different
starttimes should provide unicity to the entries. If it's likely to have 2
entries for 1 name and 1 startTime then you need to insert in the primary
key another column that will provide the uniqueness. This column will be
last clustering key and you will not need to involve it in queries - the
role will be only for uniqueness. *


 Valentina


On Thu, Mar 1, 2018 at 3:26 PM, Behroz Sikander  wrote:

> Thank you for your response.
>
> I have been through the document and I have tried these techniques but I
> failed to model my queries correctly.
>
> Forexample, I have already tried the following:
> 1) I created another table for Query#2/3. The partition Key was StartTime
> and clustering key was name. When I execute my queries, I get an exception
> saying that I need to ALLOW FILTERING.
> 2) I created a table with Name as partitioning key and startTime as
> clustering key. This way I was able to order the data in descending order
> based on startTime. But the problem was that if a row with same "name" was
> inserted, it was overriding the previously inserted row.
>
> I am not sure how to model such queries.
>
>
> On Thu, Mar 1, 2018 at 2:02 PM, Kyrylo Lebediev 
> wrote:
>
>> Hi!
>>
>>
>> Partition key (Id in your case) must be in WHERE cause if not using
>> indexes (but indexes should be used carefully, not like in case of
>> relational DB's). Also, only columns which belong to primary key ( =
>> partition key + clustering key) can be used in WHERE in such cases. That's
>> why 2nd and 3rd are failing.
>> You might find this useful: http://cassandra.apache.org/do
>> c/latest/cql/dml.html#the-where-clause
>>
>> There are several Cassandra handbooks available on Amazon, maybe it would
>> be helpful for you to use some of them as starting point to understand
>> aspects of Cassandra data[query] modeling.
>>
>>
>> Regards,
>>
>> Kyrill
>> --
>> *From:* Behroz Sikander 
>> *Sent:* Thursday, March 1, 2018 2:36:28 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Cassandra filter with ordering query modeling
>>
>> Hi,own vote
>> favorite
>> 
>>
>> I am new to Cassandra and I am trying to model a table in Cassandra. My
>> queries look like the following
>>
>> Query #1: select * from TableA where Id = "123"Query #2: select * from 
>> TableA where name="test" orderby startTime DESCQuery #3: select * from 
>> TableA where state="running" orderby startTime DESC
>>
>> I have been able to build the table for Query #1 which looks like
>>
>> val tableAStatement = SchemaBuilder.createTable("tableA").ifNotExists.
>> addPartitionKey(Id, DataType.uuid).
>> addColumn(Name, DataType.text).
>> addColumn(StartTime, DataType.timestamp).
>> addColumn(EndTime, DataType.timestamp).
>> addColumn(State, DataType.text)
>>
>> session.execute(tableAStatement)
>>
>> but for Query#2 and 3, I have tried many different things but failed.
>> Everytime, I get stuck in a different error from cassandra.
>>
>> Considering the above queries, what would be the right table model? What
>> is the right way to model such queries.
>> Regards,
>> Behroz
>>
>
>


Re: Materialized Views and TTLs

2018-02-23 Thread Valentina Crisan
Hello Akash,

Depends what you mean by downsampling: for example if you have time as
clustering key in your table you can order data DESC inside partitions and
then just do a select with per partition limit ( last 10 entries in each
partition for example).
But if you would like to extract a totally random subset of data from an
existing table (which would correspond more to downsampling) - then you
need to do this in an analytical layer on top of Cassandra ( like Apache
Spark). As far as I know you cannot do this in Cassandra.

Valentina

On Fri, Feb 23, 2018 at 9:44 PM, Akash Gangil <akashg1...@gmail.com> wrote:

> Hi Valentina,
>
> In that case, are there any well defined ways on how to do downsampling of
> data in C*?
>
> thanks!
>
>
> On Fri, Feb 23, 2018 at 11:36 AM, Valentina Crisan <
> valentina.cri...@gmail.com> wrote:
>
>> Hello,
>>
>> as far as I know it is not intended for MV's to have a different TTL than
>> the base tables. There was patch released at some point to not allow TTL
>> setting on MV (https://issues.apache.org/jira/browse/CASSANDRA-12868).
>> MV's should inherit the TTL of the base table.
>>
>> Valentina
>>
>> On Fri, Feb 23, 2018 at 6:42 PM, Akash Gangil <akashg1...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I had a couple of questions:
>>>
>>> 1. Can I create a materialized view on a table with a TTL longer than
>>> the base table? For ex: my materialized view TTL is 1 month while my base
>>> table TTL is 1 week.
>>>
>>> 2. In the above scenario, since the data in my base table would be gone
>>> after a week, would it impact data in the materialized view?
>>>
>>> My use case if I have some time series data, which is stored in the base
>>> table by_minute and I want to downsample it to by_month. So my base table
>>> stores by_minute data but my materialized view stores stores by_week data.
>>>
>>> thanks!
>>>
>>> --
>>> Akash
>>>
>>
>>
>
>
> --
> Akash
>


Re: Materialized Views and TTLs

2018-02-23 Thread Valentina Crisan
Hello,

as far as I know it is not intended for MV's to have a different TTL than
the base tables. There was patch released at some point to not allow TTL
setting on MV (https://issues.apache.org/jira/browse/CASSANDRA-12868). MV's
should inherit the TTL of the base table.

Valentina

On Fri, Feb 23, 2018 at 6:42 PM, Akash Gangil  wrote:

> Hi,
>
> I had a couple of questions:
>
> 1. Can I create a materialized view on a table with a TTL longer than the
> base table? For ex: my materialized view TTL is 1 month while my base table
> TTL is 1 week.
>
> 2. In the above scenario, since the data in my base table would be gone
> after a week, would it impact data in the materialized view?
>
> My use case if I have some time series data, which is stored in the base
> table by_minute and I want to downsample it to by_month. So my base table
> stores by_minute data but my materialized view stores stores by_week data.
>
> thanks!
>
> --
> Akash
>


Re: Using materialized view or AllowFiltering which one is better ?

2017-10-09 Thread Valentina Crisan
Not really, my suggested primary key is similar to the one you have in your
proposed MV. The only difference is that in MV it is Cassandra that takes
care of data synchronization,  with manual denormalization you would need
to do it yourself. Example with MV: If you had username 'andreas1988' and
last_seen "2017-09-11 23:58:23' in your base table and then this user
access the service and last_seen is updated to "2017-10-09 23:58:23" in
your base table - what will happen in the background is that MV Cassandra
will delete in a batch operation from the partition "2017-09-11 23:58:23"
the username "andreas1988" and add it to partition "2017-10-09 23:58:23".
Only when this batch will finish Cassandra will update the base table.
If you denormalize manually it will be you that will need to create batches
operations and do this changes manually, making sure that you only save
last value for last_seen in your table. You will obtain the same in the end
only the operation effort will be bigger.
I understand why MV would be good for your requirements, but I have seen
from the discussions that MV is not recommended for production mainly due
to the fact that is not possible to check if a view is out of sync with the
base table. Check older discussions (one or two weeks ago) and see details
there re MV usage in production.

One other solution could be to work on your users table with a secondary
index on last_seen field ( Cassandra 3.4 onwards,  SASI would allow
operators like <, > and multiple columns indexing) - clearly better than
allow filtering - but still the whole cluster would be contacted most of
the times for your queries. Maybe combining Cassandra SASI with Spark data
locality could solve this better. But first you could try with SASI and see
the query performance.

Valentina


On Mon, Oct 9, 2017 at 7:56 PM, Avi Levi <a...@indeni.com> wrote:

> Thanks Crisan .
> I understand what you're saying. But according to your suggestion I will
> have a record for every entry while I am interested only on the last entry
> . So the proposed solution is actually keeping much more data then needed .
>
> On Oct 9, 2017 8:40 PM, "Valentina Crisan" <valentina.cri...@gmail.com>
> wrote:
>
> Allow filtering is almost never the answer, especially when you want to do
> a full table scan ( there might be some cases where the query is limited to
> a partition and allow filtering could be used). And you would like to run
> this query every minute - thus extremely good performance is required.
> Allow filtering basically brings locally in your coordinator the whole
> table content and performs local filtering of the data before answering
> your query. Performance wise is not recommended to use such an
> implementation.
>
> For a query running every minute you need to address it in one partition
> read (according to Cassandra data modeling rules) and that can be done with
> denormalization ( manually or materialized views). As far as I know and
> also from the discussions in this list MV should be used still with caution
> in production environments. Thus, the best option in my opinion is manual
> denormalization of data, building a table with partition key last_seen and
> clustering key username and adding/updating data accordingly. Furthermore
> last_seen I understand it's a value of any time/hour of day - you could
> consider building partitions per day: partition key  = (last_seen, day),
> primary key = ((last_seen,day),username)).
>
> Valentina
>
> On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi <a...@indeni.com> wrote:
>
>> Hi
>>
>> I have the following table:
>>
>> CREATE TABLE users (
>> username text,
>> last_seen bigint,
>> PRIMARY KEY (username)
>> );
>>
>> where* last_seen* is basically the writetime . Number of records in the
>> table is aprox 10 million. Insert is pretty much straightforward insert
>> into users (username, last_seen) VALUES ([username], now)
>>
>> I want to make some processing on users that were not seen for the past
>> XXX (where xxx can be hours/days ... ) by query the last_seen column
>> (this query runs every minute) e.g :
>>
>> select username from users where last_seen < (now - 1 day).
>>
>> I have two options as I see it:
>>
>>1. use materialized view :
>>
>> CREATE MATERIALIZED VIEW users_last_seen AS
>> SELECT last_seen, username
>> FROM users
>> WHERE last_seen IS NOT NULL
>> PRIMARY KEY (last_seen, username);
>>
>>
>> and simply query:
>>
>> select username from users_last_seen where last_seen < (now - 1 day)
>>
>>1.
>>
>>query the users table
>>
>>select username from users where last_seen < (now - 1 day) ALLOW
>>FILTERING
>>
>> which one is more efficient? any other options ?
>>
>> Any help will be greatly appreciated
>>
>> Best
>>
>> Avi
>>
>
>
>


Re: Using materialized view or AllowFiltering which one is better ?

2017-10-09 Thread Valentina Crisan
Allow filtering is almost never the answer, especially when you want to do
a full table scan ( there might be some cases where the query is limited to
a partition and allow filtering could be used). And you would like to run
this query every minute - thus extremely good performance is required.
Allow filtering basically brings locally in your coordinator the whole
table content and performs local filtering of the data before answering
your query. Performance wise is not recommended to use such an
implementation.

For a query running every minute you need to address it in one partition
read (according to Cassandra data modeling rules) and that can be done with
denormalization ( manually or materialized views). As far as I know and
also from the discussions in this list MV should be used still with caution
in production environments. Thus, the best option in my opinion is manual
denormalization of data, building a table with partition key last_seen and
clustering key username and adding/updating data accordingly. Furthermore
last_seen I understand it's a value of any time/hour of day - you could
consider building partitions per day: partition key  = (last_seen, day),
primary key = ((last_seen,day),username)).

Valentina

On Mon, Oct 9, 2017 at 1:13 PM, Avi Levi  wrote:

> Hi
>
> I have the following table:
>
> CREATE TABLE users (
> username text,
> last_seen bigint,
> PRIMARY KEY (username)
> );
>
> where* last_seen* is basically the writetime . Number of records in the
> table is aprox 10 million. Insert is pretty much straightforward insert
> into users (username, last_seen) VALUES ([username], now)
>
> I want to make some processing on users that were not seen for the past
> XXX (where xxx can be hours/days ... ) by query the last_seen column
> (this query runs every minute) e.g :
>
> select username from users where last_seen < (now - 1 day).
>
> I have two options as I see it:
>
>1. use materialized view :
>
> CREATE MATERIALIZED VIEW users_last_seen AS
> SELECT last_seen, username
> FROM users
> WHERE last_seen IS NOT NULL
> PRIMARY KEY (last_seen, username);
>
>
> and simply query:
>
> select username from users_last_seen where last_seen < (now - 1 day)
>
>1.
>
>query the users table
>
>select username from users where last_seen < (now - 1 day) ALLOW
>FILTERING
>
> which one is more efficient? any other options ?
>
> Any help will be greatly appreciated
>
> Best
>
> Avi
>