My advice was to use the date that the reading was recorded as part of the
Partition key instead of some arbitrary shard id.  Then you don't have to
look anything up in a different table.


create table sensorReadings (
sensorUnitId int,
sensorId int,
date_recorded date,
time timestamp,
timeShard int,
readings blob,
primary key((sensorUnitId, sensorId, date_recorded), time);


On Thu, Mar 10, 2016 at 9:29 AM Jason Kania <jason.ka...@ymail.com> wrote:

> Hi Jonathan,
>
> Thanks for the response. To make this clearer, this is what we are doing:
>
> create table sensorReadings (
> sensorUnitId int,
> sensorId int,
> time timestamp,
> timeShard int,
> readings blob,
> primary key((sensorUnitId, sensorId, timeShard), time);
>
> where timeShard is a combination of year and week of year
>
> This works exactly as you mentioned when we know what time range we are
> querying.
>
> The problem is that for those cases where we want to run through all the
> readings for all timestamps, we don't know the first and last timeShard
> value to use to constrain the query or iterate over each shard. Our
> understanding is that updating another table with the maximum or minimum
> timeShard values on every write to the above table would mean pounding a
> single row with updates and running SELECT DISTINCT pulls all partition
> keys.
>
> Hopefully this is clearer.
>
> Again, any suggestions would be appreciated.
>
> Thanks,
>
> Jason
>
> ------------------------------
> *From:* Jonathan Haddad <j...@jonhaddad.com>
> *To:* user@cassandra.apache.org; Jason Kania <jason.ka...@ymail.com>
> *Sent:* Thursday, March 10, 2016 11:21 AM
> *Subject:* Re: Strategy for dividing wide rows beyond just adding to the
> partition key
>
> Have you considered making the date (or week, or whatever, some time
> component) part of your partition key?
>
> something like:
>
> create table sensordata (
> sensor_id int,
> day date,
> ts datetime,
> reading int,
> primary key((sensor_id, day), ts);
>
> Then if you know you need data by a particular date range, just issue
> multiple async queries for each day you need.
>
> On Thu, Mar 10, 2016 at 5:57 AM Jason Kania <jason.ka...@ymail.com> wrote:
>
> Hi,
>
> We have sensor input that creates very wide rows and operations on these
> rows have started to timeout regulary. We have been trying to find a
> solution to dividing wide rows but keep hitting limitations that move the
> problem around instead of solving it.
>
> We have a partition key consisting of a sensorUnitId and a sensorId and
> use a time field to access each column in the row. We tried adding a time
> based entry, timeShardId, to the partition key that consists of the year
> and week of year during which the reading was taken. This works for a
> number of queries but for scanning all the readings against a particular
> sensorUnitId and sensorId combination, we seem to be stuck.
>
> We won't know the range of valid values of the timeShardId for a given
> sensorUnitId and sensorId combination so would have to write to an
> additional table to track the valid timeShardId. We suspect this would
> create tombstone accumulation problems given the number of updates required
> to the same row so haven't tried this option.
>
> Alternatively, we hit a different bottleneck in the form of SELECT
> DISTINCT in trying to directly access the partition keys. Since SELECT
> DISTINCT does not allow for a where clause to filter on the partition key
> values, we have to filter several hundred thousand partition keys just to
> find those related to the relevant sensorUnitId and sensorId. This problem
> will only grow worse for us.
>
> Are there any other approaches that can be suggested? We have been looking
> around, but haven't found any references beyond the initial suggestion to
> add some sort of shard id to the partition key to handle wide rows.
>
> Thanks,
>
> Jason
>
>
>
>

Reply via email to