Re: Updating only modified records (where lastModified < current date)

Ali Akhtar Wed, 13 May 2015 05:51:49 -0700

Can lightweight txns be used in a batch update?

On Wed, May 13, 2015 at 5:48 PM, Ali Akhtar <ali.rac...@gmail.com> wrote:


> The 6k is only the starting value, its expected to scale up to ~200
> million records.
>
> On Wed, May 13, 2015 at 5:44 PM, Robert Wille <rwi...@fold3.com> wrote:
>
>>  You could use lightweight transactions to update only if the record is
>> newer. It doesn’t avoid the read, it just happens under the covers, so it’s
>> not really going to be faster compared to a read-before-write pattern
>> (which is an anti-pattern, BTW). It is probably the easiest way to avoid
>> getting a whole bunch of copies of each record.
>>
>>  But even with a read-before-write pattern, I don’t understand why you
>> are worried about 6K records per hour. That’s nothing. You’re probably
>> looking at several milliseconds to do the read and write for each record
>> (depending on your storage, RF and CL), so you’re probably looking at under
>> a minute to do 6K records. If you do them in parallel, you’re probably
>> looking at several seconds. I don’t get why something that probably takes
>> less than a minute that is done once an hour is a problem.
>>
>>  BTW, I wouldn’t do all 6K in parallel. I’d use some kind of limiter
>> (e.g. a semaphore) to ensure that you don’t execute more than X queries at
>> a time.
>>
>>  Robert
>>
>>  On May 13, 2015, at 6:20 AM, Ali Akhtar <ali.rac...@gmail.com> wrote:
>>
>>  But your previous email talked about when T1 is different:
>>
>>  > Assume timestamp T1 < T2 and you stored value V with timestamp T2.
>> Then you store V’ with timestamp T1.
>>
>>  What if you issue an update twice, but with the same timestamp? E.g if
>> you ran:
>>
>>  Update .... where foo=bar USING TIMESTAMP = 10000000
>>
>>  and 1 hour later, you ran exactly the same query again. In this case,
>> the value of T is the same for both queries. Would that still cause
>> multiple values to be stored?
>>
>> On Wed, May 13, 2015 at 5:17 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>>
>>>  It will cause an overhead (compaction and read) as I described in the
>>> previous email.
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 3:13 PM
>>>
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Updating only modified records (where lastModified <
>>> current date)
>>>
>>>
>>>
>>> > I don’t understand the ETL use case and its relevance here. Can you
>>> provide more details?
>>>
>>>
>>>
>>> Basically, every 1 hour a job runs which queries an external API and
>>> gets some records. Then, I want to take only new or updated records, and
>>> insert / update them in cassandra. For records that are already in
>>> cassandra and aren't modified, I want to ignore them.
>>>
>>>
>>>
>>> Each record returns a lastModified datetime, I want to use that to
>>> determine whether a record was changed or not (if it was, it'd be updated,
>>> if not, it'd be ignored).
>>>
>>>
>>>
>>> The issue was, I'm having to do a 'select lastModified from table where
>>> id = ?' query for every record, in order to determine if db lastModified <
>>> api lastModified or not. I was wondering if there was a way to avoid that.
>>>
>>>
>>>
>>> If I use 'USING TIMESTAMP', would subsequent updates where lastModified
>>> is a value that was previously used, still create that overhead, or will
>>> they be ignored?
>>>
>>>
>>>
>>> E.g if I issued an update where TIMESTAMP is X, then 1 hour later I
>>> issued another update where TIMESTAMP is still X, will that 2nd update
>>> essentially get ignored, or will it cause any overhead?
>>>
>>>
>>>
>>> On Wed, May 13, 2015 at 5:02 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>>>
>>> USING TIMESTAMP doesn’t avoid compaction overhead.
>>>
>>> When you modify data the value is stored along with a timestamp
>>> indicating the timestamp of the value.
>>>
>>> Assume timestamp T1 < T2 and you stored value V with timestamp T2. Then
>>> you store V’ with timestamp T1.
>>>
>>> Now you have two values of V in the DB: <V,T2>, <V’,T1>
>>>
>>> When you read the value of V from the DB you read both <V,T2>, <V’,T1>,
>>> Cassandra resolves the conflict by comparing the timestamp and returns V.
>>>
>>> Compaction will later take care and remove <V’,T1> from the DB.
>>>
>>>
>>>
>>> I don’t understand the ETL use case and its relevance here. Can you
>>> provide more details?
>>>
>>>
>>>
>>> UPDATE in Cassandra updates specific rows. All of them are updated,
>>> nothing is ignored.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 2:43 PM
>>>
>>>
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Updating only modified records (where lastModified <
>>> current date)
>>>
>>>
>>>
>>> Its rare for an existing record to have changes, but the etl job runs
>>> every hour, therefore it will send updates each time, regardless of whether
>>> there were changes or not.
>>>
>>>
>>>
>>> (I'm assuming that USING TIMESTAMP here will avoid the compaction
>>> overhead, since that will cause it to not run any updates unless the
>>> timestamp is actually > last update timestamp?)
>>>
>>>
>>>
>>> Also, is there a way to get the number of rows which were updated /
>>> ignored?
>>>
>>>
>>>
>>> On Wed, May 13, 2015 at 4:37 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>>>
>>> The cost of issuing an UPDATE that won’t update anything is compaction
>>> overhead. Since you stated it’s rare for rows to be updated then the
>>> overhead should be negligible.
>>>
>>>
>>>
>>> The easiest way to convert a milliseconds timestamp long value to
>>> microseconds is to multiply by 1000.
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 2:15 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Updating only modified records (where lastModified <
>>> current date)
>>>
>>>
>>>
>>> Would TimeUnit.MILLISECONDS.toMicros(  myDate.getTime() ) work for
>>> producing the microsecond timestamp ?
>>>
>>>
>>>
>>> On Wed, May 13, 2015 at 4:09 PM, Ali Akhtar <ali.rac...@gmail.com>
>>> wrote:
>>>
>>> If specifying 'using' timestamp, the docs say to provide microseconds,
>>> but where are these microseconds obtained from? I have regular
>>> java.util.Date objects, I can get the time in milliseconds (i.e the unix
>>> timestamp), how would I convert that to microseconds?
>>>
>>>
>>>
>>> On Wed, May 13, 2015 at 3:56 PM, Ali Akhtar <ali.rac...@gmail.com>
>>> wrote:
>>>
>>> Thanks Peter, that's interesting. I didn't know of that option.
>>>
>>>
>>>
>>> If updates don't create tombstones (and i'm already taking pains to
>>> ensure no nulls are present in queries), then is there no cost to just
>>> submitting an update for everything regardless of whether lastModified has
>>> changed?
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>> On Wed, May 13, 2015 at 3:38 PM, Peer, Oded <oded.p...@rsa.com> wrote:
>>>
>>> You can use the “last modified” value as the TIMESTAMP for your UPDATE
>>> operation.
>>>
>>> This way the values will only be updated if lastModified date > the
>>> lastModified you have in the DB.
>>>
>>>
>>>
>>> Updates to values don’t create tombstones. Only deletes (either by
>>> executing delete, inserting a null value or by setting a TTL) create
>>> tombstones.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Ali Akhtar [mailto:ali.rac...@gmail.com]
>>> *Sent:* Wednesday, May 13, 2015 1:27 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Updating only modified records (where lastModified < current
>>> date)
>>>
>>>
>>>
>>> I'm running some ETL jobs, where the pattern is the following:
>>>
>>>
>>>
>>> 1- Get some records from an external API,
>>>
>>>
>>>
>>> 2- For each record, see if its lastModified date > the lastModified i
>>> have in db (or if I don't have that record in db)
>>>
>>>
>>>
>>> 3- If lastModified < dbLastModified, the item wasn't changed, ignore it.
>>> Otherwise, run an update query and update that record.
>>>
>>>
>>>
>>> (It is rare for existing records to get updated, so I'm not that
>>> concerned about tombstones).
>>>
>>>
>>>
>>> The problem however is, since I have to query each record's
>>> lastModified, one at a time, that's adding a major bottleneck to my job.
>>>
>>>
>>>
>>> E.g if I have 6k records, I have to run a total of 6k 'select
>>> lastModified from myTable where id = ?' queries.
>>>
>>>
>>>
>>> Is there a better way, am I doing anything wrong, etc? Any suggestions
>>> would be appreciated.
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>

Re: Updating only modified records (where lastModified < current date)

Reply via email to