If all of your queries like this(i mean get all devices given a  a time
range) Hadoop would be more appropriate since those are analytical queries.

Anyway, to query such data with spark Cassandra connector  your partition
key could include day and hash of your deviceid as pseudo partition key
column (could be abs(murmur(deviceid)%500) we add this column to distribute
data more evenly) . When you want query a time range you should generate a
rdd of tuple2 with all days that intersect with that date and for each day
your rdd should include 0..500 range. Like:

(20170726,0)
(20170726,1)
.
.
.
(20170726,499)

Then you should join this rdd with your table using joinwithcassandratable
method.

On Jul 26, 2017 4:41 PM, "Junaid Nasir" <jna...@an10.io> wrote:

all devices.
After selecting the data I group them and perform other actions i.e sum,
avg on fields and then display those to compare how devices are doing
compared to each other.

On Wed, Jul 26, 2017 at 5:32 PM, CPC <acha...@gmail.com> wrote:

> Hi Junaid,
>
> Given a time range do you want to take all devices or a specific device?
>
>
> On Jul 26, 2017 3:15 PM, "Junaid Nasir" <jna...@an10.io> wrote:
>
> I have a C* cluster (3 nodes) with some 60gb data (replication factor 2).
> when I started using C* coming from SQL background didn't give much thought
> about modeling the data correctly. so what I did was
>
> CREATE TABLE data ( deviceId int,
>                     time timestamp,
>                     field1 text,
>                     filed2 text,
>                     field3 text,
>                     PRIMARY KEY(deviceId, time)) WITH CLUSTERING ORDER BY 
> (time ASC);
>
> but most of the queries I run (using spark and datastax connector)
> compares data of different devices for some time period. for example
>
> SELECT * FROM data WHERE time > '2017-07-01 12:00:00';
>
> from my understanding this runs a full table scan. as shown in spark UI
> (from DAG visualization "Scan org.apache.spark.sql.cassandra
> .CassandraSourceRelation@32bb7d65") meaning C* will read all the data and
> then filter for time. Spark jobs runs for hours even for smaller time
> frames.
>
> what is the right approach for data modeling for such queries?. I want to
> get a general idea of things to look for when modeling such data.
> really appreciate all the help from this community :). if you need any
> extra details please ask me here.
>
> Regards,
> Junaid
>
>
>

Reply via email to