Re: What is the best way to model my time series?

Gerard Maas Fri, 25 Mar 2016 11:05:07 -0700

Hi,

It sounds to me like Apache Kafka would be a better fit for your
requirements. Have you considered that option?


kr, Gerard
Datastax MVP for Apache Cassandra (so, I'm not suggesting other tech for
any other reason that seeing it as a better fit)

On Fri, Mar 25, 2016 at 1:31 PM, K. Lawson <klawso...@gmail.com> wrote:

> While adhering to best practices, I am trying to model a time series in
> Cassandra that is compliant with the following access pattern directives:
>
> - Is to be both read and shrank by a single party, grown by multiple
> parties
> - Is to be read as a queue (in other words, its entries, from first to
> last, are to be paged through in order)
> - Is to grown as a queue (in other words, new entries (the number of which
> is expected to fall in the range of 0 to a couple of hundred per day) are
> always APPENDED to the series)
> - Is to be shrunk by way of the removal of any entries which have been
> processed by the application (immediately upon completion of said
> processing)
>
> So far, I've come up with four solutions, listed below (along with their
> pros and cons), that are compliant with
> the directives given above; is there any solution superior to these, and
> if not, which one of these is most optimal?
>
>
>
> Solution #1:
>
>
> //Processing position markers (saved somewhere on disk)
> mostRecentProcessedItemInsertTime = 0
> mostRecentProcessedItemInsertDayStartTime = 0
>
> CREATE TABLE IF NOT EXISTS solution_table_1
> (
> itemInsertDayStartTime timestamp
> itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (itemInsertDayStartTime, itemInsertTime, itemId)
> );
> //Initial row retrieval query (presumably, the position markers will be
> appropriately updated after each retrieval)
>
> SELECT *
>
> FROM solution_table_1
>
> WHERE itemInsertDayStartTime IN
> (mostRecentProcessedItemInsertDayStartTime,
> mostRecentProcessedItemInsertDayStartTime + 86400000, ...)
>
> AND itemInsertTime > mostRecentProcessedItemInsertTime
>
> LIMIT 30
>
> Pros:
> - Shards table data across the cluster
>
> Cons:
> - Requires the maintenance of position markers
> - Requires the explicit specification of partitions (which may or may not
> have data) to target for retrievals which page the table data by
> itemInsertTime
> - Requires correspondence with multiple nodes to satisfy retrievals which
> page the table data by itemInsertTime
>
>
> Solution #2:
>
>
> CREATE TABLE IF NOT EXISTS solution_table_2
> (
>   itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (itemInserTime, itemId)
> );
> CREATE INDEX IF NOT EXISTS ON solution_table_2 (itemInsertTime);
>
> //Initial row retrieval query
> SELECT * FROM solution_table_2 WHERE itemInsertTime > 0 LIMIT 30 ALLOW
> FILTERING
>
> Pros:
> - Shards table data across the cluster
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying partitions to target
>
> Cons:
> - Specifies the creation of an index on a high-cardinality column
> - Requires correspondence with multiple nodes, as well as data filtering,
> to satisfy retrievals which page the table data by itemInsertTime
> Solution #3:
>
> CREATE TABLE IF NOT EXISTS solution_table_3
> (
> itemInsertTime timestamp
> itemId timeuuid
> itemInsertDayStartTime timestamp
> PRIMARY KEY (itemInsertTime, itemId)
> );
> CREATE INDEX IF NOT EXISTS ON solution_table_3 (itemInsertDayStartTime);
> //Initial row retrieval query
> SELECT * FROM solution_table_3 WHERE itemInsertDayStartTime > 0 LIMIT 30
> ALLOW FILTERING
>
> Pros:
> - Shards table data across the cluster
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying partitions to target
> - Specifies the creation of an index on a column with anticipatively
> suitable cardinality
>
> Cons:
> - Requires correspondence with multiple nodes, as well as data filtering,
> to satisfy retrievals which page the table data by itemInsertTime
> Solution #4:
>
> CREATE TABLE IF NOT EXISTS solution_table_4
> (
> dummyPartitionInt int
> itemInsertTime timestamp
> itemId timeuuid
> PRIMARY KEY (dummyPartitionInt, itemInsertTime, itemId)
> );
> //Initial row retrieval query (assuming all rows are inserted with a
> dummyPartitionInt value of 0)
> SELECT * FROM solution_table_4 WHERE dummyPartitionInt = 0 AND
> itemInsertTime > 0 LIMIT 30
>
>
> Pros:
> - Enables retrieval to be satisfied with a single replica set
> - Enables retrievals which page table data by itemInsertTime to be
> conducted without explicitly specifying more than one partition to target
>
> Cons:
> - Requires the use of a "dummy" column
> - Specifies the constriction of table data (and as a result, all
> operations on it) to a single partition
>

Re: What is the best way to model my time series?

Reply via email to