Hi Faisal,

Out of curiosity, does the upstream system(s) guarantee that these timestamps 
will be unique, even at the microsecond level? (I’ve seen some systems that 
write a “microsecond precision” which is really just a millisecond value + a 
sub-millisecond value that increments every time it is queried within a certain 
time window and thus guaranteeing uniqueness within a set of values taken from 
the same clock).

Disclaimer on all of the below: I’m not an HBase expert, so I’m just applying 
my knowledge of using other databases to try to help you troubleshoot this 
situation.

Do you have control over the HBase Schema, and if so, is there another 
field/value you could use for transaction id other than timestamp?  According 
to the little I have read, one should avoid using timestamps for row keys 
anyway [1]. That aside, it seems like you have a uniqueness challenge 
truncating your microseconds, so you’ll have to consider one of the following:


  1.  Change the HBase schema to use some other key. If you can,  just derive a 
primary key to use as a transaction id for the Hbase record. You could just use 
the string representation of the microsecond timestamp (assuming you have 
guarantees that is going to be unique in your input data) or run it through 
some hash algorithm and store the digest, which might also improve the HBase 
region distribution when bulk writing. You can still generate a millisecond 
timestamp to write to Hbase, but you can drop the unique requirement for the 
data as you will have some other key. If the upstream system is not 
guaranteeing that the timestamps will be unique, it sounds to me like you 
should probably not be relying on the timestamp as a transaction id at all.
  2.  If you don’t control the target Hbase system and you’re stuck using 
millisecond timestamps as transactions ids, then you could consider modifying 
the data you are inserting so that the timestamps are always unique. This 
may/may not be an option depending on how many transactions per second you are 
dealing with in the input data. If the rate of transactions is low (e.g., much 
less than 1000 per second), then if you get a millisecond collision when 
truncating microseconds you can deal with that by modifying the duplicate 
timestamp (i.e., the second one to get truncated to the same value) to the 
closest unused millisecond value. This adds overhead (you have to keep track of 
milliseconds used within a time window / batch, or have a failure path for data 
flow that tries to insert into Hbase until successful, modifyingthe key each 
time), and again, doesn’t work if you have more than 1000 transactions per 
second as then you will inevitably start skewing the timestamps by quite a bit.

If it were me, I would try hard to use something other than timestamp as your 
row key / transaction id to avoid this altogether… even if the timestamps is 
part of whatever you end up using as that id.

Hope this helps and good luck,
Kevin

[1] http://hbase.apache.org/book.html#rowkey.design

From: Faisal Durrani <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Monday, December 3, 2018 at 22:03
To: "[email protected]" <[email protected]>
Subject: Re: Expression language - Convert ISO 8601 to unixtimestamp

Hi Kevin,

Thank you for your mail and suggestions. I was also able to achieve the same by 
using the below expression

${current_ts:substring(0,19):toDate("yyyy-MM-dd'T'HH:mm:ss","UTC"):toNumber():plus(${current_ts:substring(20,23)})}

So I am taking out the millisecond part of the microsecond time and adding into 
the unix time generated from yyyy-MM-dd'T'HH:mm:ss. While this worked for the 
one of the case i was working with, sadly it failed for another one. As it 
turns out I need to have a microsecond precision so e.g. I have one transaction 
that is coming at current_ts: 2018-11-11T00:17:27.937000 and then I have 
another one coming at current_ts: 2018-11-11T00:17:27.937001 . The Unix 
timestamp converted from the above expression is in millisecond (13 digit epoch 
number) so as i understand it cannot be more precise then this. If it take out 
the microsecond precision the target system which is Hbase in my case considers 
it as the same transaction and over writes the record received at the earlier 
time. Hbase timestamp too can only accept unix time in millisecond (13 digit 
epoch number) precision. So i am lost at this point how to order these 
transactions :(. Please do me know if you have any suggestion on this.

Regards,
Faisal

On Tue, Dec 4, 2018 at 1:49 AM Kevin Doran 
<[email protected]<mailto:[email protected]>> wrote:
Hi Faisal,

It appears whatever is writing these date strings that you have for inputs is 
writing microseconds, not milliseconds,

So when you are using `.SSSSSS`, that is, in this case `937000` microseconds 
being interpreted by the expression language java parser as 937000 milliseconds 
(15 minutes 37 seconds), hence the “slightly off” factor you are experiencing 
for the minutes/seconds of the result.

The difference in hour/day can be explained by timezone. Your input timestring 
must be localtime, but you are treating it as UTC. Specify the correct timezone 
when calling toDate() and that will shift the hours.

So, my suggestion is to run these date strings through and expression that 
first “converts” microseconds to milliseconds via truncation (using substring 
to drop off the trailing “000” is equivalent to dividing by 1000 and dropping 
the remainder/fraction), and then using your expression with the timezone 
specified. Given that your example was off by 9 hours (in addition to the ~15 
minutes), I used UTC+9 time zone and got this expression, which worked for me:

${current_ts:substring(0,${current_ts:length():minus(3)}):toDate("yyyy-MM-dd'T'HH:mm:ss.SSSSSS",
 "AWDT"):toNumber()}

Lastly, depending on how you are using the result of toNumber(), keep in mind 
that some systems expect seconds since epoch (not milliseconds, which 
toNumber() outputs) for a Unix timestamp.

Cheers,
Kevin

From: Faisal Durrani <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Sunday, December 2, 2018 at 21:59
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Expression language - Convert ISO 8601 to unixtimestamp

Hi Guys,

I am having trouble converting the ISO 8601 to a unixtimestamp . Here is what i 
have tried

current_ts: 2018-11-11T00:17:27.937000

Using updateAttribute, I have configured the below property

${current_ts:toDate("yyyy-MM-dd'T'HH:mm:ss.SSSSSS"):toNumber()}
This gives the Output value : 1541863984000
if converted back from epochcoverter, its giving

Assuming that this timestamp is in milliseconds:

GMT: Saturday, 10 November 2018 15:33:04

Your time zone: Sunday, 11 November 2018 00:33:04 
GMT+09:00<https://www.epochconverter.com/timezones?q=1541863984>

Relative: 22 days ago



I have also tried using ${current_ts:toDate("yyyy-MM-dd'T'HH:mm:ss.SSSSSS", 
"UTC"):toNumber()} but the result is the same and its completely wrong. Kindly 
help finding out correct expression.

Regards,
Faisal

Reply via email to