Re: ORC: duplicate record - rowid meaning ?

Peter Vary Sun, 01 Dec 2019 11:37:33 -0800

Thanks David,
Hope that Hive 3 streaming will help you soon to avoid these kind of headaches 
:)
Peter


> On Dec 1, 2019, at 17:57, David Morin <morin.david....@gmail.com> wrote:
> 
> Hi Peter,
> 
> At the moment I have a pipeline based on Flink to write Orc Files. These Orc 
> Files can be read from Hive thanks to external tables and, then, a merge 
> statement (triggered by oozie) push these data into tables managed by Hive 
> (transactional tables => ORC). Hive version is 2.1 because this is the one 
> provided by HDP 2.6.5.
> We've developed a system that write Hive Delta Files for the managed tables 
> directly from Flink.
> The current streaming apis for Hive 2 are not suitable for our needs and we 
> cannot use the new Hive 3 streaming api yet. This system uses the Flink state 
> to store Hive metadata (originalTransaction, bucket, rowId, ..)
> Thanks for your reply because yes, when files are ordered by 
> originalTransacion, bucket, rowId
> it works ! I just have to use 1 transaction instead of 2 at the moment and it 
> will be ok.
> 
> Thanks
> David
> 
> On 2019/11/29 11:18:05, Peter Vary <pv...@cloudera.com> wrote: 
>> Hi David,
>> 
>> Not entirely sure what you are doing here :), my guess is that you are 
>> trying to write ACID tables outside of hive. Am I right? What is the exact 
>> use-case? There might be better solutions out there than writing the files 
>> by hand.
>> 
>> As for your question below: Yes, the files should be ordered by: 
>> originalTransacion, bucket, rowId triple, otherwise you will get wrong 
>> results.
>> 
>> Thanks,
>> Peter
>> 
>>> On Nov 19, 2019, at 13:30, David Morin <morin.david....@gmail.com> wrote:
>>> 
>>> here after more details about ORC content and the fact we have duplicate 
>>> rows:
>>> 
>>> /delta_0011365_0011365_0000/bucket_00003
>>> 
>>> {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}}
>>> {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}}
>>> 
>>> 
>>> /delta_0011368_0011368_0000/bucket_00003
>>> 
>>> {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null}
>>> {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null}
>>> 
>>> /delta_0011369_0011369_0000/bucket_00003
>>> 
>>> {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}}
>>> {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}}
>>> 
>>> +-------------------------------------------------+-------+--+
>>> |                     row__id                     |  cle  |
>>> +-------------------------------------------------+-------+--+
>>> | {"transactionid":11367,"bucketid":0,"rowid":0}  | 5209  |
>>> | {"transactionid":11369,"bucketid":0,"rowid":0}  | 5211  |
>>> | {"transactionid":11369,"bucketid":1,"rowid":0}  | 5210  |
>>> | {"transactionid":11369,"bucketid":2,"rowid":0}  | 5214  |
>>> | {"transactionid":11369,"bucketid":2,"rowid":1}  | 5215  |
>>> | {"transactionid":11365,"bucketid":3,"rowid":0}  | 5218  |
>>> | {"transactionid":11365,"bucketid":3,"rowid":1}  | 5216  |
>>> | {"transactionid":11369,"bucketid":3,"rowid":1}  | 5216  |
>>> | {"transactionid":11369,"bucketid":3,"rowid":0}  | 5218  |
>>> | {"transactionid":11369,"bucketid":4,"rowid":0}  | 5217  |
>>> | {"transactionid":11369,"bucketid":4,"rowid":1}  | 5213  |
>>> | {"transactionid":11369,"bucketid":7,"rowid":0}  | 5212  |
>>> +-------------------------------------------------+-------+--+
>>> 
>>> As you can see we have duplicate rows for column "cle" 5216 and 5218
>>> Do we have to keep the rowids ordered ? because this is the only difference 
>>> I have noticed based on some tests with beeline.
>>> 
>>> Thanks
>>> 
>>> 
>>> 
>>> Le mar. 19 nov. 2019 à 00:18, David Morin <morin.david....@gmail.com 
>>> <mailto:morin.david....@gmail.com>> a écrit :
>>> Hello,
>>> 
>>> I'm trying to understand the purpose of the rowid column inside ORC delta 
>>> file
>>> {"transactionid":11359,"bucketid":5,"rowid":0}
>>> Orc view: 
>>> {"operation":0,"originalTransaction":11359,"bucket":5,"rowId":0,"currentTransaction":11359,"row":...}
>>> I use HDP 2.6 => Hive 2
>>> 
>>> If I want to be idempotent with INSERT / DELETE / INSERT. 
>>> Do we have to keep the same rowid ?
>>> It seems that when the rowid is changed during the second INSERT I have a 
>>> duplicate row.
>>> For me, I can create a new rowid for the new transaction during the second 
>>> INSERT but that seems to generate duplicate records.
>>> 
>>> Regards,
>>> David
>>> 
>>> 
>>> 
>> 
>>

Re: ORC: duplicate record - rowid meaning ?

Reply via email to