Re: What is the best way to model my time series?

2016-03-25 Thread K. Lawson
Sorry Gerard, I'm afraid I'm not familiar with that project.

The time series I've described is a relatively minor component of an
application which is already powered by Cassandra, so you can see why I'd
prefer a viable way (which I'm quickly learning may not exist) to modelit
in Cassandra.

On Fri, Mar 25, 2016 at 2:04 PM, Gerard Maas  wrote:

> Hi,
>
> It sounds to me like Apache Kafka would be a better fit for your
> requirements. Have you considered that option?
>
> kr, Gerard
> Datastax MVP for Apache Cassandra (so, I'm not suggesting other tech for
> any other reason that seeing it as a better fit)
>
> On Fri, Mar 25, 2016 at 1:31 PM, K. Lawson  wrote:
>
>> While adhering to best practices, I am trying to model a time series in
>> Cassandra that is compliant with the following access pattern directives:
>>
>> - Is to be both read and shrank by a single party, grown by multiple
>> parties
>> - Is to be read as a queue (in other words, its entries, from first to
>> last, are to be paged through in order)
>> - Is to grown as a queue (in other words, new entries (the number of
>> which is expected to fall in the range of 0 to a couple of hundred per day)
>> are always APPENDED to the series)
>> - Is to be shrunk by way of the removal of any entries which have been
>> processed by the application (immediately upon completion of said
>> processing)
>>
>> So far, I've come up with four solutions, listed below (along with their
>> pros and cons), that are compliant with
>> the directives given above; is there any solution superior to these, and
>> if not, which one of these is most optimal?
>>
>>
>>
>> Solution #1:
>>
>>
>> //Processing position markers (saved somewhere on disk)
>> mostRecentProcessedItemInsertTime = 0
>> mostRecentProcessedItemInsertDayStartTime = 0
>>
>> CREATE TABLE IF NOT EXISTS solution_table_1
>> (
>> itemInsertDayStartTime timestamp
>> itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
>> );
>> //Initial row retrieval query (presumably, the position markers will be
>> appropriately updated after each retrieval)
>>
>> SELECT *
>>
>> FROM solution_table_1
>>
>> WHERE itemInsertDayStartTime IN
>> (mostRecentProcessedItemInsertDayStartTime,
>> mostRecentProcessedItemInsertDayStartTime + 8640, ...)
>>
>> AND itemInsertTime > mostRecentProcessedItemInsertTime
>>
>> LIMIT 30
>>
>> Pros:
>> - Shards table data across the cluster
>>
>> Cons:
>> - Requires the maintenance of position markers
>> - Requires the explicit specification of partitions (which may or may not
>> have data) to target for retrievals which page the table data by
>> itemInsertTime
>> - Requires correspondence with multiple nodes to satisfy retrievals which
>> page the table data by itemInsertTime
>>
>>
>> Solution #2:
>>
>>
>> CREATE TABLE IF NOT EXISTS solution_table_2
>> (
>>   itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (itemInserTime, itemId)
>> );
>> CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);
>>
>> //Initial row retrieval query
>> SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW
>> FILTERING
>>
>> Pros:
>> - Shards table data across the cluster
>> - Enables retrievals which page table data by itemInsertTime to be
>> conducted without explicitly specifying partitions to target
>>
>> Cons:
>> - Specifies the creation of an index on a high-cardinality column
>> - Requires correspondence with multiple nodes, as well as data filtering,
>> to satisfy retrievals which page the table data by itemInsertTime
>> Solution #3:
>>
>> CREATE TABLE IF NOT EXISTS solution_table_3
>> (
>> itemInsertTime timestamp
>> itemId timeuuid
>> itemInsertDayStartTime timestamp
>> PRIMARY KEY (itemInsertTime, itemId)
>> );
>> CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);
>> //Initial row retrieval query
>> SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30
>> ALLOW FILTERING
>>
>> Pros:
>> - Shards table data across the cluster
>> - Enables retrievals which page table data by itemInsertTime to be
>> conducted without explicitly specifying partitions to target
>> - Specifies the creation of an index on a column with anticipatively
>> suitable cardinality
>>
>> Cons:
>> - Requires correspondence with multiple nodes, as well as data filtering,
>> to satisfy retrievals which page the table data by itemInsertTime
>> Solution #4:
>>
>> CREATE TABLE IF NOT EXISTS solution_table_4
>> (
>> dummyPartitionInt int
>> itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (dummyPartitionInt, itemInsertTime, itemId)
>> );
>> //Initial row retrieval query (assuming all rows are inserted with a
>> dummyPartitionInt value of 0)
>> SELECT * FROM solution_table_4 WHERE dummyPartitionInt = 0 AND
>> itemInsertTime > 0 LIMIT 30
>>
>>
>> Pros:
>> - Enables retrieval to be satisfied with a single replica set
>> - Enables retrievals which page table data by 

Re: What is the best way to model my time series?

2016-03-25 Thread Gerard Maas
Hi,

It sounds to me like Apache Kafka would be a better fit for your
requirements. Have you considered that option?

kr, Gerard
Datastax MVP for Apache Cassandra (so, I'm not suggesting other tech for
any other reason that seeing it as a better fit)

On Fri, Mar 25, 2016 at 1:31 PM, K. Lawson  wrote:

> While adhering to best practices, I am trying to model a time series in
> Cassandra that is compliant with the following access pattern directives:
>
> - Is to be both read and shrank by a single party, grown by multiple
> parties
> - Is to be read as a queue (in other words, its entries, from first to
> last, are to be paged through in order)
> - Is to grown as a queue (in other words, new entries (the number of which
> is expected to fall in the range of 0 to a couple of hundred per day) are
> always APPENDED to the series)
> - Is to be shrunk by way of the removal of any entries which have been
> processed by the application (immediately upon completion of said
> processing)
>
> So far, I've come up with four solutions, listed below (along with their
> pros and cons), that are compliant with
> the directives given above; is there any solution superior to these, and
> if not, which one of these is most optimal?
>
>
>
> Solution #1:
>
>
> //Processing position markers (saved somewhere on disk)
> mostRecentProcessedItemInsertTime = 0
> mostRecentProcessedItemInsertDayStartTime = 0
>
> CREATE TABLE IF NOT EXISTS solution_table_1
> (
> itemInsertDayStartTime timestamp
> itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
> );
> //Initial row retrieval query (presumably, the position markers will be
> appropriately updated after each retrieval)
>
> SELECT *
>
> FROM solution_table_1
>
> WHERE itemInsertDayStartTime IN
> (mostRecentProcessedItemInsertDayStartTime,
> mostRecentProcessedItemInsertDayStartTime + 8640, ...)
>
> AND itemInsertTime > mostRecentProcessedItemInsertTime
>
> LIMIT 30
>
> Pros:
> - Shards table data across the cluster
>
> Cons:
> - Requires the maintenance of position markers
> - Requires the explicit specification of partitions (which may or may not
> have data) to target for retrievals which page the table data by
> itemInsertTime
> - Requires correspondence with multiple nodes to satisfy retrievals which
> page the table data by itemInsertTime
>
>
> Solution #2:
>
>
> CREATE TABLE IF NOT EXISTS solution_table_2
> (
>   itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (itemInserTime, itemId)
> );
> CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);
>
> //Initial row retrieval query
> SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW
> FILTERING
>
> Pros:
> - Shards table data across the cluster
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying partitions to target
>
> Cons:
> - Specifies the creation of an index on a high-cardinality column
> - Requires correspondence with multiple nodes, as well as data filtering,
> to satisfy retrievals which page the table data by itemInsertTime
> Solution #3:
>
> CREATE TABLE IF NOT EXISTS solution_table_3
> (
> itemInsertTime timestamp
> itemId timeuuid
> itemInsertDayStartTime timestamp
> PRIMARY KEY (itemInsertTime, itemId)
> );
> CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);
> //Initial row retrieval query
> SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30
> ALLOW FILTERING
>
> Pros:
> - Shards table data across the cluster
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying partitions to target
> - Specifies the creation of an index on a column with anticipatively
> suitable cardinality
>
> Cons:
> - Requires correspondence with multiple nodes, as well as data filtering,
> to satisfy retrievals which page the table data by itemInsertTime
> Solution #4:
>
> CREATE TABLE IF NOT EXISTS solution_table_4
> (
> dummyPartitionInt int
> itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (dummyPartitionInt, itemInsertTime, itemId)
> );
> //Initial row retrieval query (assuming all rows are inserted with a
> dummyPartitionInt value of 0)
> SELECT * FROM solution_table_4 WHERE dummyPartitionInt = 0 AND
> itemInsertTime > 0 LIMIT 30
>
>
> Pros:
> - Enables retrieval to be satisfied with a single replica set
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying more than one partition to target
>
> Cons:
> - Requires the use of a "dummy" column
> - Specifies the constriction of table data (and as a result, all
> operations on it) to a single partition
>


Re: What is the best way to model my time series?

2016-03-25 Thread K. Lawson
Hi Jack, thanks for the interest in my inquiry. Let me see if I can answer
your questions.

1. The growth rate of the time series is expected to be relatively constant
throughout a given day, while processing is expected to be carried out in
bursts, several times a day.

2. I'm not sure what you mean by "per item"; the time series consists of
append acts each featuring a *unique* item, as in
[appendOneConductTime:itemD, appendTwoConductTime:itemA,
appendThreeConductTime:itemZ, ...]. The figures I gave were estimates on
the number of appends carried out in a given day

3. If "N" is taken to be the number of appends carried out in a given day,
and most, if not all, of the appends carried out in a given day are to be
processed in that day, then the aggregate number of appends and removes in
a day is expected to be anywhere from 1.51N to 2N. Taking an append count
at the high end of the expected range, 500, that leaves us with a ceiling
of around 1000 total appends and removes in a day.

4. Though the time series must paged through chronologically, processing of
the items in it can be carried out in any order (i'm aware that if the
processing of items was carried out in FIFO order, avoiding the scanning of
tombstones would be as simple as keeping a positional marker at the head of
the queue)

5. Only one type of read, that which pages through the time series.
Processing is carried out using the ids of items procured through said
reads.

6. Again, i'm not sure what you mean by "per item" (see #2). For any given
day however, there isn't expected to be a significant backlog of
unprocessed appends from previous days. So, if "N" is taken to be the
number of appends carried out in a given day, the maximum number of
unprocessed appends in that day would most likely not exceed 3N (1N "newly"
unprocessed appends, and 2N previously unprocessed appends spanning several
days beforehand). In concrete numbers, this leaves us with a ceiling of
around 1500 unprocessed appends at any given time.

Again, I thank you for taking time out of your day for this.

On Fri, Mar 25, 2016 at 11:40 AM, Jack Krupansky 
wrote:

> Still trying to get a handle on the magnitude of the problem...
>
> 1. You said that the rate of growth is a max of a few hundred, but no
> mention of the rate of processing (removal).
> 2. Are these numbers per item or for all items? In any case, how many
> items are you anticipating? Ballpark - dozens, hundreds, thousands,
> millions?
> 3. In short, what is the aggregate number of appends and removes per day?
> 4. Clarify whether the order of removal is strictly by time or by a
> combination of time and item.
> 5. Is there a separate "read" access distinct from the read that also
> results in removal at the end of processing?
> 6. Finally, what is the expected per-item and aggregate number of
> unprocessed events that are expected to be resident in the total queue at
> any moment of time? IOW, how wide might the row be for an item.
>
> I concur with the general sentiment that a queue is a clear antipattern
> for Cassandra. But... you can probably get it to work with enough care and
> sufficient provisioning of the cluster.
>
> The big problem is that rapid, large-scale removal from the queue
> generates tons of tombstones that need to be removed.
>
> The DateTieredCompactionStrategy may help as well.
>
> -- Jack Krupansky
>
> On Fri, Mar 25, 2016 at 8:31 AM, K. Lawson  wrote:
>
>> While adhering to best practices, I am trying to model a time series in
>> Cassandra that is compliant with the following access pattern directives:
>>
>> - Is to be both read and shrank by a single party, grown by multiple
>> parties
>> - Is to be read as a queue (in other words, its entries, from first to
>> last, are to be paged through in order)
>> - Is to grown as a queue (in other words, new entries (the number of
>> which is expected to fall in the range of 0 to a couple of hundred per day)
>> are always APPENDED to the series)
>> - Is to be shrunk by way of the removal of any entries which have been
>> processed by the application (immediately upon completion of said
>> processing)
>>
>> So far, I've come up with four solutions, listed below (along with their
>> pros and cons), that are compliant with
>> the directives given above; is there any solution superior to these, and
>> if not, which one of these is most optimal?
>>
>>
>>
>> Solution #1:
>>
>>
>> //Processing position markers (saved somewhere on disk)
>> mostRecentProcessedItemInsertTime = 0
>> mostRecentProcessedItemInsertDayStartTime = 0
>>
>> CREATE TABLE IF NOT EXISTS solution_table_1
>> (
>> itemInsertDayStartTime timestamp
>> itemInsertTime timestamp
>> itemId timeuuid
>> PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
>> );
>> //Initial row retrieval query (presumably, the position markers will be
>> appropriately updated after each retrieval)
>>
>> SELECT *
>>
>> FROM solution_table_1
>>
>> WHERE 

Re: What is the best way to model my time series?

2016-03-25 Thread Jack Krupansky
Still trying to get a handle on the magnitude of the problem...

1. You said that the rate of growth is a max of a few hundred, but no
mention of the rate of processing (removal).
2. Are these numbers per item or for all items? In any case, how many items
are you anticipating? Ballpark - dozens, hundreds, thousands, millions?
3. In short, what is the aggregate number of appends and removes per day?
4. Clarify whether the order of removal is strictly by time or by a
combination of time and item.
5. Is there a separate "read" access distinct from the read that also
results in removal at the end of processing?
6. Finally, what is the expected per-item and aggregate number of
unprocessed events that are expected to be resident in the total queue at
any moment of time? IOW, how wide might the row be for an item.

I concur with the general sentiment that a queue is a clear antipattern for
Cassandra. But... you can probably get it to work with enough care and
sufficient provisioning of the cluster.

The big problem is that rapid, large-scale removal from the queue generates
tons of tombstones that need to be removed.

The DateTieredCompactionStrategy may help as well.

-- Jack Krupansky

On Fri, Mar 25, 2016 at 8:31 AM, K. Lawson  wrote:

> While adhering to best practices, I am trying to model a time series in
> Cassandra that is compliant with the following access pattern directives:
>
> - Is to be both read and shrank by a single party, grown by multiple
> parties
> - Is to be read as a queue (in other words, its entries, from first to
> last, are to be paged through in order)
> - Is to grown as a queue (in other words, new entries (the number of which
> is expected to fall in the range of 0 to a couple of hundred per day) are
> always APPENDED to the series)
> - Is to be shrunk by way of the removal of any entries which have been
> processed by the application (immediately upon completion of said
> processing)
>
> So far, I've come up with four solutions, listed below (along with their
> pros and cons), that are compliant with
> the directives given above; is there any solution superior to these, and
> if not, which one of these is most optimal?
>
>
>
> Solution #1:
>
>
> //Processing position markers (saved somewhere on disk)
> mostRecentProcessedItemInsertTime = 0
> mostRecentProcessedItemInsertDayStartTime = 0
>
> CREATE TABLE IF NOT EXISTS solution_table_1
> (
> itemInsertDayStartTime timestamp
> itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
> );
> //Initial row retrieval query (presumably, the position markers will be
> appropriately updated after each retrieval)
>
> SELECT *
>
> FROM solution_table_1
>
> WHERE itemInsertDayStartTime IN
> (mostRecentProcessedItemInsertDayStartTime,
> mostRecentProcessedItemInsertDayStartTime + 8640, ...)
>
> AND itemInsertTime > mostRecentProcessedItemInsertTime
>
> LIMIT 30
>
> Pros:
> - Shards table data across the cluster
>
> Cons:
> - Requires the maintenance of position markers
> - Requires the explicit specification of partitions (which may or may not
> have data) to target for retrievals which page the table data by
> itemInsertTime
> - Requires correspondence with multiple nodes to satisfy retrievals which
> page the table data by itemInsertTime
>
>
> Solution #2:
>
>
> CREATE TABLE IF NOT EXISTS solution_table_2
> (
>   itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (itemInserTime, itemId)
> );
> CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);
>
> //Initial row retrieval query
> SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW
> FILTERING
>
> Pros:
> - Shards table data across the cluster
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying partitions to target
>
> Cons:
> - Specifies the creation of an index on a high-cardinality column
> - Requires correspondence with multiple nodes, as well as data filtering,
> to satisfy retrievals which page the table data by itemInsertTime
> Solution #3:
>
> CREATE TABLE IF NOT EXISTS solution_table_3
> (
> itemInsertTime timestamp
> itemId timeuuid
> itemInsertDayStartTime timestamp
> PRIMARY KEY (itemInsertTime, itemId)
> );
> CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);
> //Initial row retrieval query
> SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30
> ALLOW FILTERING
>
> Pros:
> - Shards table data across the cluster
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying partitions to target
> - Specifies the creation of an index on a column with anticipatively
> suitable cardinality
>
> Cons:
> - Requires correspondence with multiple nodes, as well as data filtering,
> to satisfy retrievals which page the table data by itemInsertTime
> Solution #4:
>
> CREATE TABLE IF NOT EXISTS solution_table_4
> (
> dummyPartitionInt 

RE: What is the best way to model my time series?

2016-03-25 Thread SEAN_R_DURITY
I think this one is better…
https://www.google.com/url?sa=t=j==s=web=1=rja=8=0ahUKEwjF6eD9hdzLAhWCCj4KHfNwDHoQFggdMAA=https%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fcassandra-user%2F201603.mbox%2F%253CCANeMN%3D-KEiXXgyWLnSYKnhQMWfWmmy3b68PkLsw55CGSm_UmmQ%40mail.gmail.com%253E=AFQjCNGGod8eDERVUbwHe6SVBoOq1y8N7Q=zzAKcoVlUb2M6YuuNy8AHw

Sean Durity

From: K. Lawson [mailto:klawso...@gmail.com]
Sent: Friday, March 25, 2016 10:17 AM
To: user@cassandra.apache.org
Subject: Re: What is the best way to model my time series?


Sean, the link you have supplied does not seem to work.

On Fri, Mar 25, 2016 at 9:43 AM, 
<sean_r_dur...@homedepot.com<mailto:sean_r_dur...@homedepot.com>> wrote:
You might take a look at this previous conversation on queue-type applications 
and Cassandra. Generally this is an anti-pattern for a distributed system like 
Cassandra.
https://mail-archives.apache.org/mod_mbox/cassandra-user/201603.mbox/

Re: What is the best way to model my time series?

2016-03-25 Thread K. Lawson
Sean, the link you have supplied does not seem to work.

On Fri, Mar 25, 2016 at 9:43 AM,  wrote:

> You might take a look at this previous conversation on queue-type
> applications and Cassandra. Generally this is an anti-pattern for a
> distributed system like Cassandra.
>
>
> 

RE: What is the best way to model my time series?

2016-03-25 Thread SEAN_R_DURITY
You might take a look at this previous conversation on queue-type applications 
and Cassandra. Generally this is an anti-pattern for a distributed system like 
Cassandra.