Re: Data model suggestions

Ali Akhtar Mon, 27 Apr 2015 04:54:36 -0700

Wouldn't truncating the table create tombstones?

On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded <oded.p...@rsa.com> wrote:


>  I recommend truncating the table instead of dropping it since you don’t
> need to re-issue DDL commands and put load on the system keyspace.
>
> Both DROP and TRUNCATE automatically create snapshots, there no
> “snapshotting” advantage for using DROP . See
> http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Sunday, April 26, 2015 10:31 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Data model suggestions
>
>
>
> Thanks Peer. I like the approach you're suggesting.
>
>
>
> Why do you recommend truncating the last active table rather than just
> dropping it? Since all the data would be inserted into a new table, seems
> like it would make sense to drop the last table, and that way truncate
> snapshotting also won't have to be dealt with (unless I'm missing anything).
>
>
>
> Thanks.
>
>
>
>
>
> On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>
> I would maintain two tables.
>
> An “archive” table that holds all the active and inactive records, and is
> updated hourly (re-inserting the same record has some compaction overhead
> but on the other side deleting records has tombstones overhead).
>
> An “active” table which holds all the records in the last external API
> invocation.
>
> To avoid tombstones and read-before-delete issues “active” should actually
> a synonym, an alias, to the most recent active table.
>
> I suggest you create two identical tables, “active1” and “active2”, and an
> “active_alias” table that informs which of the two is the most recent.
>
> Thus when you query the external API you insert the data to “archive” and
> to the unaliased “activeN” table, switch the alias value in “active_alias”
> and truncate the new unaliased “activeM” table.
>
> No need to query the data before inserting it. Make sure truncating
> doesn’t create automatic snapshots.
>
>
>
>
>
> *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com]
> *Sent:* Friday, April 24, 2015 6:53 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Data model suggestions
>
>
>
> I think one table say record should be good. The primary key is record id.
> This will ensure good distribution.
> Just update the active attribute to true or false.
> For range query on active vs archive records maintain 2 indexes or try
> secondary index.
>
> On Apr 23, 2015 1:32 PM, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>
> Good point about the range selects. I think they can be made to work with
> limits, though. Or, since the active records will never usually be > 500k,
> the ids may just be cached in memory.
>
>
>
> Most of the time, during reads, the queries will just consist of select *
> where primaryKey = someValue . One row at a time.
>
>
>
> The question is just, whether to keep all records in one table (including
> archived records which wont be queried 99% of the time), or to keep active
> records in their own table, and delete them when they're no longer active.
> Will that produce tombstone issues?
>
>
>
> On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <khangaon...@gmail.com>
> wrote:
>
> Hi,
>
> If your external API returns active records, that means I am guessing you
> need to do a select * on the active table to figure out which records in
> the table are no longer active.
>
> You might be aware that range selects based on partition key will timeout
> in cassandra. They can however be made to work using the column cluster
> key.
>
> To comment more, We would need to see your proposed cassandra tables and
> queries that you might need to run.
>
> regards
>
>
>
>
>
>
>
> On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
> That's returned by the external API we're querying. We query them for
> active records, if a previous active record isn't included in the results,
> that means its time to archive that record.
>
>
>
> On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <khangaon...@gmail.com>
> wrote:
>
> Hi,
>
> How do you determine if the record is no longer active ? Is it a perioidic
> process that goes through every record and checks when the last update
> happened ?
>
> regards
>
>
>
> On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
> Hey all,
>
>
>
> We are working on moving a mysql based application to Cassandra.
>
>
>
> The workflow in mysql is this: We have two tables: active and archive .
> Every hour, we pull in data from an external API. The records which are
> active, are kept in 'active' table. Once a record is no longer active, its
> deleted from 'active' and re-inserted into 'archive'
>
>
>
> The purpose for that, is because most of the time, queries are only done
> against the active records rather than archived. Therefore keeping the
> active table small may help with faster queries, if it only has to search
> 200k records vs 3 million or more.
>
>
>
> Is it advisable to keep the same data model in Cassandra? I'm concerned
> about tombstone issues when records are deleted from active.
>
>
>
> Thanks.
>
>
>
>   --
>
> http://khangaonkar.blogspot.com/
>
>
>
>
>
>   --
>
> http://khangaonkar.blogspot.com/
>
>
>
>
>

Re: Data model suggestions

Reply via email to