Re: ORC: duplicate record - rowid meaning ?

David Morin Sun, 01 Dec 2019 08:57:50 -0800

Hi Peter,

At the moment I have a pipeline based on Flink to write Orc Files. These Orc 
Files can be read from Hive thanks to external tables and, then, a merge 
statement (triggered by oozie) push these data into tables managed by Hive 
(transactional tables => ORC). Hive version is 2.1 because this is the one 
provided by HDP 2.6.5.
We've developed a system that write Hive Delta Files for the managed tables 
directly from Flink.
The current streaming apis for Hive 2 are not suitable for our needs and we 
cannot use the new Hive 3 streaming api yet. This system uses the Flink state 
to store Hive metadata (originalTransaction, bucket, rowId, ..)
Thanks for your reply because yes, when files are ordered by 
originalTransacion, bucket, rowId
it works ! I just have to use 1 transaction instead of 2 at the moment and it 
will be ok.


Thanks
David

On 2019/11/29 11:18:05, Peter Vary <pv...@cloudera.com> wrote: 
> Hi David,
> 
> Not entirely sure what you are doing here :), my guess is that you are trying 
> to write ACID tables outside of hive. Am I right? What is the exact use-case? 
> There might be better solutions out there than writing the files by hand.
> 
> As for your question below: Yes, the files should be ordered by: 
> originalTransacion, bucket, rowId triple, otherwise you will get wrong 
> results.
> 
> Thanks,
> Peter
> 
> > On Nov 19, 2019, at 13:30, David Morin <morin.david....@gmail.com> wrote:
> > 
> > here after more details about ORC content and the fact we have duplicate 
> > rows:
> > 
> > /delta_0011365_0011365_0000/bucket_00003
> > 
> > {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}}
> > {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}}
> > 
> > 
> > /delta_0011368_0011368_0000/bucket_00003
> > 
> > {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null}
> > {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null}
> > 
> > /delta_0011369_0011369_0000/bucket_00003
> > 
> > {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}}
> > {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}}
> > 
> > +-------------------------------------------------+-------+--+
> > |                     row__id                     |  cle  |
> > +-------------------------------------------------+-------+--+
> > | {"transactionid":11367,"bucketid":0,"rowid":0}  | 5209  |
> > | {"transactionid":11369,"bucketid":0,"rowid":0}  | 5211  |
> > | {"transactionid":11369,"bucketid":1,"rowid":0}  | 5210  |
> > | {"transactionid":11369,"bucketid":2,"rowid":0}  | 5214  |
> > | {"transactionid":11369,"bucketid":2,"rowid":1}  | 5215  |
> > | {"transactionid":11365,"bucketid":3,"rowid":0}  | 5218  |
> > | {"transactionid":11365,"bucketid":3,"rowid":1}  | 5216  |
> > | {"transactionid":11369,"bucketid":3,"rowid":1}  | 5216  |
> > | {"transactionid":11369,"bucketid":3,"rowid":0}  | 5218  |
> > | {"transactionid":11369,"bucketid":4,"rowid":0}  | 5217  |
> > | {"transactionid":11369,"bucketid":4,"rowid":1}  | 5213  |
> > | {"transactionid":11369,"bucketid":7,"rowid":0}  | 5212  |
> > +-------------------------------------------------+-------+--+
> > 
> > As you can see we have duplicate rows for column "cle" 5216 and 5218
> > Do we have to keep the rowids ordered ? because this is the only difference 
> > I have noticed based on some tests with beeline.
> > 
> > Thanks
> > 
> > 
> > 
> > Le mar. 19 nov. 2019 à 00:18, David Morin <morin.david....@gmail.com 
> > <mailto:morin.david....@gmail.com>> a écrit :
> > Hello,
> > 
> > I'm trying to understand the purpose of the rowid column inside ORC delta 
> > file
> > {"transactionid":11359,"bucketid":5,"rowid":0}
> > Orc view: 
> > {"operation":0,"originalTransaction":11359,"bucket":5,"rowId":0,"currentTransaction":11359,"row":...}
> > I use HDP 2.6 => Hive 2
> > 
> > If I want to be idempotent with INSERT / DELETE / INSERT. 
> > Do we have to keep the same rowid ?
> > It seems that when the rowid is changed during the second INSERT I have a 
> > duplicate row.
> > For me, I can create a new rowid for the new transaction during the second 
> > INSERT but that seems to generate duplicate records.
> > 
> > Regards,
> > David
> > 
> > 
> > 
> 
>

Re: ORC: duplicate record - rowid meaning ?

Reply via email to