Re: Data model suggestions

Laing, Michael Mon, 27 Apr 2015 05:11:58 -0700

No - it immediately removes the sstables on all nodes.

On Mon, Apr 27, 2015 at 7:53 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:


> Wouldn't truncating the table create tombstones?
>
> On Mon, Apr 27, 2015 at 11:55 AM, Peer, Oded <oded.p...@rsa.com> wrote:
>
>>  I recommend truncating the table instead of dropping it since you don’t
>> need to re-issue DDL commands and put load on the system keyspace.
>>
>> Both DROP and TRUNCATE automatically create snapshots, there no
>> “snapshotting” advantage for using DROP . See
>> http://docs.datastax.com/en/cassandra/2.1/cassandra/configuration/configCassandra_yaml_r.html?scroll=reference_ds_qfg_n1r_1k__auto_snapshot
>>
>>
>>
>>
>>
>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>> *Sent:* Sunday, April 26, 2015 10:31 PM
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Data model suggestions
>>
>>
>>
>> Thanks Peer. I like the approach you're suggesting.
>>
>>
>>
>> Why do you recommend truncating the last active table rather than just
>> dropping it? Since all the data would be inserted into a new table, seems
>> like it would make sense to drop the last table, and that way truncate
>> snapshotting also won't have to be dealt with (unless I'm missing anything).
>>
>>
>>
>> Thanks.
>>
>>
>>
>>
>>
>> On Sun, Apr 26, 2015 at 1:29 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>>
>> I would maintain two tables.
>>
>> An “archive” table that holds all the active and inactive records, and is
>> updated hourly (re-inserting the same record has some compaction overhead
>> but on the other side deleting records has tombstones overhead).
>>
>> An “active” table which holds all the records in the last external API
>> invocation.
>>
>> To avoid tombstones and read-before-delete issues “active” should
>> actually a synonym, an alias, to the most recent active table.
>>
>> I suggest you create two identical tables, “active1” and “active2”, and
>> an “active_alias” table that informs which of the two is the most recent.
>>
>> Thus when you query the external API you insert the data to “archive” and
>> to the unaliased “activeN” table, switch the alias value in “active_alias”
>> and truncate the new unaliased “activeM” table.
>>
>> No need to query the data before inserting it. Make sure truncating
>> doesn’t create automatic snapshots.
>>
>>
>>
>>
>>
>> *From:* Narendra Sharma [mailto:narendra.sha...@gmail.com]
>> *Sent:* Friday, April 24, 2015 6:53 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Data model suggestions
>>
>>
>>
>> I think one table say record should be good. The primary key is record
>> id. This will ensure good distribution.
>> Just update the active attribute to true or false.
>> For range query on active vs archive records maintain 2 indexes or try
>> secondary index.
>>
>> On Apr 23, 2015 1:32 PM, "Ali Akhtar" <ali.rac...@gmail.com> wrote:
>>
>> Good point about the range selects. I think they can be made to work with
>> limits, though. Or, since the active records will never usually be > 500k,
>> the ids may just be cached in memory.
>>
>>
>>
>> Most of the time, during reads, the queries will just consist of select *
>> where primaryKey = someValue . One row at a time.
>>
>>
>>
>> The question is just, whether to keep all records in one table (including
>> archived records which wont be queried 99% of the time), or to keep active
>> records in their own table, and delete them when they're no longer active.
>> Will that produce tombstone issues?
>>
>>
>>
>> On Fri, Apr 24, 2015 at 12:56 AM, Manoj Khangaonkar <
>> khangaon...@gmail.com> wrote:
>>
>> Hi,
>>
>> If your external API returns active records, that means I am guessing you
>> need to do a select * on the active table to figure out which records in
>> the table are no longer active.
>>
>> You might be aware that range selects based on partition key will timeout
>> in cassandra. They can however be made to work using the column cluster
>> key.
>>
>> To comment more, We would need to see your proposed cassandra tables and
>> queries that you might need to run.
>>
>> regards
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Apr 23, 2015 at 9:45 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>> That's returned by the external API we're querying. We query them for
>> active records, if a previous active record isn't included in the results,
>> that means its time to archive that record.
>>
>>
>>
>> On Thu, Apr 23, 2015 at 9:20 PM, Manoj Khangaonkar <khangaon...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> How do you determine if the record is no longer active ? Is it a
>> perioidic process that goes through every record and checks when the last
>> update happened ?
>>
>> regards
>>
>>
>>
>> On Thu, Apr 23, 2015 at 8:09 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>> Hey all,
>>
>>
>>
>> We are working on moving a mysql based application to Cassandra.
>>
>>
>>
>> The workflow in mysql is this: We have two tables: active and archive .
>> Every hour, we pull in data from an external API. The records which are
>> active, are kept in 'active' table. Once a record is no longer active, its
>> deleted from 'active' and re-inserted into 'archive'
>>
>>
>>
>> The purpose for that, is because most of the time, queries are only done
>> against the active records rather than archived. Therefore keeping the
>> active table small may help with faster queries, if it only has to search
>> 200k records vs 3 million or more.
>>
>>
>>
>> Is it advisable to keep the same data model in Cassandra? I'm concerned
>> about tombstone issues when records are deleted from active.
>>
>>
>>
>> Thanks.
>>
>>
>>
>>   --
>>
>> http://khangaonkar.blogspot.com/
>>
>>
>>
>>
>>
>>   --
>>
>> http://khangaonkar.blogspot.com/
>>
>>
>>
>>
>>
>
>

Re: Data model suggestions

Reply via email to