But your previous email talked about when T1 is different:

> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
you store V’ with timestamp T1.

What if you issue an update twice, but with the same timestamp? E.g if you
ran:

Update .... where foo=bar USING TIMESTAMP = 10000000

and 1 hour later, you ran exactly the same query again. In this case, the
value of T is the same for both queries. Would that still cause multiple
values to be stored?

On Wed, May 13, 2015 at 5:17 PM, Peer, Oded <oded.p...@rsa.com> wrote:

>  It will cause an overhead (compaction and read) as I described in the
> previous email.
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 3:13 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> > I don’t understand the ETL use case and its relevance here. Can you
> provide more details?
>
>
>
> Basically, every 1 hour a job runs which queries an external API and gets
> some records. Then, I want to take only new or updated records, and insert
> / update them in cassandra. For records that are already in cassandra and
> aren't modified, I want to ignore them.
>
>
>
> Each record returns a lastModified datetime, I want to use that to
> determine whether a record was changed or not (if it was, it'd be updated,
> if not, it'd be ignored).
>
>
>
> The issue was, I'm having to do a 'select lastModified from table where id
> = ?' query for every record, in order to determine if db lastModified < api
> lastModified or not. I was wondering if there was a way to avoid that.
>
>
>
> If I use 'USING TIMESTAMP', would subsequent updates where lastModified is
> a value that was previously used, still create that overhead, or will they
> be ignored?
>
>
>
> E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued
> another update where TIMESTAMP is still X, will that 2nd update essentially
> get ignored, or will it cause any overhead?
>
>
>
> On Wed, May 13, 2015 at 5:02 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>
> USING TIMESTAMP doesn’t avoid compaction overhead.
>
> When you modify data the value is stored along with a timestamp indicating
> the timestamp of the value.
>
> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
> you store V’ with timestamp T1.
>
> Now you have two values of V in the DB: <V,T2>, <V’,T1>
>
> When you read the value of V from the DB you read both <V,T2>, <V’,T1>,
> Cassandra resolves the conflict by comparing the timestamp and returns V.
>
> Compaction will later take care and remove <V’,T1> from the DB.
>
>
>
> I don’t understand the ETL use case and its relevance here. Can you
> provide more details?
>
>
>
> UPDATE in Cassandra updates specific rows. All of them are updated,
> nothing is ignored.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:43 PM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Its rare for an existing record to have changes, but the etl job runs
> every hour, therefore it will send updates each time, regardless of whether
> there were changes or not.
>
>
>
> (I'm assuming that USING TIMESTAMP here will avoid the compaction
> overhead, since that will cause it to not run any updates unless the
> timestamp is actually > last update timestamp?)
>
>
>
> Also, is there a way to get the number of rows which were updated /
> ignored?
>
>
>
> On Wed, May 13, 2015 at 4:37 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>
> The cost of issuing an UPDATE that won’t update anything is compaction
> overhead. Since you stated it’s rare for rows to be updated then the
> overhead should be negligible.
>
>
>
> The easiest way to convert a milliseconds timestamp long value to
> microseconds is to multiply by 1000.
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 2:15 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Updating only modified records (where lastModified <
> current date)
>
>
>
> Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
> producing the microsecond timestamp ?
>
>
>
> On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
> If specifying 'using' timestamp, the docs say to provide microseconds, but
> where are these microseconds obtained from? I have regular java.util.Date
> objects, I can get the time in milliseconds (i.e the unix timestamp), how
> would I convert that to microseconds?
>
>
>
> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>
> Thanks Peter, that's interesting. I didn't know of that option.
>
>
>
> If updates don't create tombstones (and i'm already taking pains to ensure
> no nulls are present in queries), then is there no cost to just submitting
> an update for everything regardless of whether lastModified has changed?
>
>
>
> Thanks.
>
>
>
> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>
> You can use the “last modified” value as the TIMESTAMP for your UPDATE
> operation.
>
> This way the values will only be updated if lastModified date > the
> lastModified you have in the DB.
>
>
>
> Updates to values don’t create tombstones. Only deletes (either by
> executing delete, inserting a null value or by setting a TTL) create
> tombstones.
>
>
>
>
>
> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
> *Sent:* Wednesday, May 13, 2015 1:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Updating only modified records (where lastModified < current
> date)
>
>
>
> I'm running some ETL jobs, where the pattern is the following:
>
>
>
> 1- Get some records from an external API,
>
>
>
> 2- For each record, see if its lastModified date > the lastModified i have
> in db (or if I don't have that record in db)
>
>
>
> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
> Otherwise, run an update query and update that record.
>
>
>
> (It is rare for existing records to get updated, so I'm not that concerned
> about tombstones).
>
>
>
> The problem however is, since I have to query each record's lastModified,
> one at a time, that's adding a major bottleneck to my job.
>
>
>
> E.g if I have 6k records, I have to run a total of 6k 'select lastModified
> from myTable where id = ?' queries.
>
>
>
> Is there a better way, am I doing anything wrong, etc? Any suggestions
> would be appreciated.
>
>
>
> Thanks.
>
>
>
>
>
>
>
>
>
>
>

Reply via email to