But your previous email talked about when T1 is different: > Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then you store V’ with timestamp T1.
What if you issue an update twice, but with the same timestamp? E.g if you ran: Update .... where foo=bar USING TIMESTAMP = 10000000 and 1 hour later, you ran exactly the same query again. In this case, the value of T is the same for both queries. Would that still cause multiple values to be stored? On Wed, May 13, 2015 at 5:17 PM, Peer, Oded <oded.p...@rsa.com> wrote: > It will cause an overhead (compaction and read) as I described in the > previous email. > > > > *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] > *Sent:* Wednesday, May 13, 2015 3:13 PM > > *To:* user@cassandra.apache.org > *Subject:* Re: Updating only modified records (where lastModified < > current date) > > > > > I don’t understand the ETL use case and its relevance here. Can you > provide more details? > > > > Basically, every 1 hour a job runs which queries an external API and gets > some records. Then, I want to take only new or updated records, and insert > / update them in cassandra. For records that are already in cassandra and > aren't modified, I want to ignore them. > > > > Each record returns a lastModified datetime, I want to use that to > determine whether a record was changed or not (if it was, it'd be updated, > if not, it'd be ignored). > > > > The issue was, I'm having to do a 'select lastModified from table where id > = ?' query for every record, in order to determine if db lastModified < api > lastModified or not. I was wondering if there was a way to avoid that. > > > > If I use 'USING TIMESTAMP', would subsequent updates where lastModified is > a value that was previously used, still create that overhead, or will they > be ignored? > > > > E.g if I issued an update where TIMESTAMP is X, then 1 hour later I issued > another update where TIMESTAMP is still X, will that 2nd update essentially > get ignored, or will it cause any overhead? > > > > On Wed, May 13, 2015 at 5:02 PM, Peer, Oded <oded.p...@rsa.com> wrote: > > USING TIMESTAMP doesn’t avoid compaction overhead. > > When you modify data the value is stored along with a timestamp indicating > the timestamp of the value. > > Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then > you store V’ with timestamp T1. > > Now you have two values of V in the DB: <V,T2>, <V’,T1> > > When you read the value of V from the DB you read both <V,T2>, <V’,T1>, > Cassandra resolves the conflict by comparing the timestamp and returns V. > > Compaction will later take care and remove <V’,T1> from the DB. > > > > I don’t understand the ETL use case and its relevance here. Can you > provide more details? > > > > UPDATE in Cassandra updates specific rows. All of them are updated, > nothing is ignored. > > > > > > *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] > *Sent:* Wednesday, May 13, 2015 2:43 PM > > > *To:* user@cassandra.apache.org > *Subject:* Re: Updating only modified records (where lastModified < > current date) > > > > Its rare for an existing record to have changes, but the etl job runs > every hour, therefore it will send updates each time, regardless of whether > there were changes or not. > > > > (I'm assuming that USING TIMESTAMP here will avoid the compaction > overhead, since that will cause it to not run any updates unless the > timestamp is actually > last update timestamp?) > > > > Also, is there a way to get the number of rows which were updated / > ignored? > > > > On Wed, May 13, 2015 at 4:37 PM, Peer, Oded <oded.p...@rsa.com> wrote: > > The cost of issuing an UPDATE that won’t update anything is compaction > overhead. Since you stated it’s rare for rows to be updated then the > overhead should be negligible. > > > > The easiest way to convert a milliseconds timestamp long value to > microseconds is to multiply by 1000. > > > > *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] > *Sent:* Wednesday, May 13, 2015 2:15 PM > *To:* user@cassandra.apache.org > *Subject:* Re: Updating only modified records (where lastModified < > current date) > > > > Would TimeUnit.MILLISECONDS.toMicros( myDate.getTime() ) work for > producing the microsecond timestamp ? > > > > On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: > > If specifying 'using' timestamp, the docs say to provide microseconds, but > where are these microseconds obtained from? I have regular java.util.Date > objects, I can get the time in milliseconds (i.e the unix timestamp), how > would I convert that to microseconds? > > > > On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: > > Thanks Peter, that's interesting. I didn't know of that option. > > > > If updates don't create tombstones (and i'm already taking pains to ensure > no nulls are present in queries), then is there no cost to just submitting > an update for everything regardless of whether lastModified has changed? > > > > Thanks. > > > > On Wed, May 13, 2015 at 3:38 PM, Peer, Oded <oded.p...@rsa.com> wrote: > > You can use the “last modified” value as the TIMESTAMP for your UPDATE > operation. > > This way the values will only be updated if lastModified date > the > lastModified you have in the DB. > > > > Updates to values don’t create tombstones. Only deletes (either by > executing delete, inserting a null value or by setting a TTL) create > tombstones. > > > > > > *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] > *Sent:* Wednesday, May 13, 2015 1:27 PM > *To:* user@cassandra.apache.org > *Subject:* Updating only modified records (where lastModified < current > date) > > > > I'm running some ETL jobs, where the pattern is the following: > > > > 1- Get some records from an external API, > > > > 2- For each record, see if its lastModified date > the lastModified i have > in db (or if I don't have that record in db) > > > > 3- If lastModified < dbLastModified, the item wasn't changed, ignore it. > Otherwise, run an update query and update that record. > > > > (It is rare for existing records to get updated, so I'm not that concerned > about tombstones). > > > > The problem however is, since I have to query each record's lastModified, > one at a time, that's adding a major bottleneck to my job. > > > > E.g if I have 6k records, I have to run a total of 6k 'select lastModified > from myTable where id = ?' queries. > > > > Is there a better way, am I doing anything wrong, etc? Any suggestions > would be appreciated. > > > > Thanks. > > > > > > > > > > >