Can lightweight txns be used in a batch update? On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:
> The 6k is only the starting value, its expected to scale up to ~200 > million records. > > On Wed, May 13, 2015 at 5:44 PM, Robert Wille <rwi...@fold3.com> wrote: > >> You could use lightweight transactions to update only if the record is >> newer. It doesn’t avoid the read, it just happens under the covers, so it’s >> not really going to be faster compared to a read-before-write pattern >> (which is an anti-pattern, BTW). It is probably the easiest way to avoid >> getting a whole bunch of copies of each record. >> >> But even with a read-before-write pattern, I don’t understand why you >> are worried about 6K records per hour. That’s nothing. You’re probably >> looking at several milliseconds to do the read and write for each record >> (depending on your storage, RF and CL), so you’re probably looking at under >> a minute to do 6K records. If you do them in parallel, you’re probably >> looking at several seconds. I don’t get why something that probably takes >> less than a minute that is done once an hour is a problem. >> >> BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter >> (e.g. a semaphore) to ensure that you don’t execute more than X queries at >> a time. >> >> Robert >> >> On May 13, 2015, at 6:20 AM, Ali Akhtar <ali.rac...@gmail.com> wrote: >> >> But your previous email talked about when T1 is different: >> >> > Assume timestamp T1 < T2 and you stored value V with timestamp T2. >> Then you store V’ with timestamp T1. >> >> What if you issue an update twice, but with the same timestamp? E.g if >> you ran: >> >> Update .... where foo=bar USING TIMESTAMP = 10000000 >> >> and 1 hour later, you ran exactly the same query again. In this case, >> the value of T is the same for both queries. Would that still cause >> multiple values to be stored? >> >> On Wed, May 13, 2015 at 5:17 PM, Peer, Oded <oded.p...@rsa.com> wrote: >> >>> It will cause an overhead (compaction and read) as I described in the >>> previous email. >>> >>> >>> >>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] >>> *Sent:* Wednesday, May 13, 2015 3:13 PM >>> >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: Updating only modified records (where lastModified < >>> current date) >>> >>> >>> >>> > I don’t understand the ETL use case and its relevance here. Can you >>> provide more details? >>> >>> >>> >>> Basically, every 1 hour a job runs which queries an external API and >>> gets some records. Then, I want to take only new or updated records, and >>> insert / update them in cassandra. For records that are already in >>> cassandra and aren't modified, I want to ignore them. >>> >>> >>> >>> Each record returns a lastModified datetime, I want to use that to >>> determine whether a record was changed or not (if it was, it'd be updated, >>> if not, it'd be ignored). >>> >>> >>> >>> The issue was, I'm having to do a 'select lastModified from table where >>> id = ?' query for every record, in order to determine if db lastModified < >>> api lastModified or not. I was wondering if there was a way to avoid that. >>> >>> >>> >>> If I use 'USING TIMESTAMP', would subsequent updates where lastModified >>> is a value that was previously used, still create that overhead, or will >>> they be ignored? >>> >>> >>> >>> E.g if I issued an update where TIMESTAMP is X, then 1 hour later I >>> issued another update where TIMESTAMP is still X, will that 2nd update >>> essentially get ignored, or will it cause any overhead? >>> >>> >>> >>> On Wed, May 13, 2015 at 5:02 PM, Peer, Oded <oded.p...@rsa.com> wrote: >>> >>> USING TIMESTAMP doesn’t avoid compaction overhead. >>> >>> When you modify data the value is stored along with a timestamp >>> indicating the timestamp of the value. >>> >>> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then >>> you store V’ with timestamp T1. >>> >>> Now you have two values of V in the DB: <V,T2>, <V’,T1> >>> >>> When you read the value of V from the DB you read both <V,T2>, <V’,T1>, >>> Cassandra resolves the conflict by comparing the timestamp and returns V. >>> >>> Compaction will later take care and remove <V’,T1> from the DB. >>> >>> >>> >>> I don’t understand the ETL use case and its relevance here. Can you >>> provide more details? >>> >>> >>> >>> UPDATE in Cassandra updates specific rows. All of them are updated, >>> nothing is ignored. >>> >>> >>> >>> >>> >>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] >>> *Sent:* Wednesday, May 13, 2015 2:43 PM >>> >>> >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: Updating only modified records (where lastModified < >>> current date) >>> >>> >>> >>> Its rare for an existing record to have changes, but the etl job runs >>> every hour, therefore it will send updates each time, regardless of whether >>> there were changes or not. >>> >>> >>> >>> (I'm assuming that USING TIMESTAMP here will avoid the compaction >>> overhead, since that will cause it to not run any updates unless the >>> timestamp is actually > last update timestamp?) >>> >>> >>> >>> Also, is there a way to get the number of rows which were updated / >>> ignored? >>> >>> >>> >>> On Wed, May 13, 2015 at 4:37 PM, Peer, Oded <oded.p...@rsa.com> wrote: >>> >>> The cost of issuing an UPDATE that won’t update anything is compaction >>> overhead. Since you stated it’s rare for rows to be updated then the >>> overhead should be negligible. >>> >>> >>> >>> The easiest way to convert a milliseconds timestamp long value to >>> microseconds is to multiply by 1000. >>> >>> >>> >>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] >>> *Sent:* Wednesday, May 13, 2015 2:15 PM >>> *To:* user@cassandra.apache.org >>> *Subject:* Re: Updating only modified records (where lastModified < >>> current date) >>> >>> >>> >>> Would TimeUnit.MILLISECONDS.toMicros( myDate.getTime() ) work for >>> producing the microsecond timestamp ? >>> >>> >>> >>> On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar <ali.rac...@gmail.com> >>> wrote: >>> >>> If specifying 'using' timestamp, the docs say to provide microseconds, >>> but where are these microseconds obtained from? I have regular >>> java.util.Date objects, I can get the time in milliseconds (i.e the unix >>> timestamp), how would I convert that to microseconds? >>> >>> >>> >>> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar <ali.rac...@gmail.com> >>> wrote: >>> >>> Thanks Peter, that's interesting. I didn't know of that option. >>> >>> >>> >>> If updates don't create tombstones (and i'm already taking pains to >>> ensure no nulls are present in queries), then is there no cost to just >>> submitting an update for everything regardless of whether lastModified has >>> changed? >>> >>> >>> >>> Thanks. >>> >>> >>> >>> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded <oded.p...@rsa.com> wrote: >>> >>> You can use the “last modified” value as the TIMESTAMP for your UPDATE >>> operation. >>> >>> This way the values will only be updated if lastModified date > the >>> lastModified you have in the DB. >>> >>> >>> >>> Updates to values don’t create tombstones. Only deletes (either by >>> executing delete, inserting a null value or by setting a TTL) create >>> tombstones. >>> >>> >>> >>> >>> >>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com] >>> *Sent:* Wednesday, May 13, 2015 1:27 PM >>> *To:* user@cassandra.apache.org >>> *Subject:* Updating only modified records (where lastModified < current >>> date) >>> >>> >>> >>> I'm running some ETL jobs, where the pattern is the following: >>> >>> >>> >>> 1- Get some records from an external API, >>> >>> >>> >>> 2- For each record, see if its lastModified date > the lastModified i >>> have in db (or if I don't have that record in db) >>> >>> >>> >>> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it. >>> Otherwise, run an update query and update that record. >>> >>> >>> >>> (It is rare for existing records to get updated, so I'm not that >>> concerned about tombstones). >>> >>> >>> >>> The problem however is, since I have to query each record's >>> lastModified, one at a time, that's adding a major bottleneck to my job. >>> >>> >>> >>> E.g if I have 6k records, I have to run a total of 6k 'select >>> lastModified from myTable where id = ?' queries. >>> >>> >>> >>> Is there a better way, am I doing anything wrong, etc? Any suggestions >>> would be appreciated. >>> >>> >>> >>> Thanks. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >