Hi Peter, At the moment I have a pipeline based on Flink to write Orc Files. These Orc Files can be read from Hive thanks to external tables and, then, a merge statement (triggered by oozie) push these data into tables managed by Hive (transactional tables => ORC). Hive version is 2.1 because this is the one provided by HDP 2.6.5. We've developed a system that write Hive Delta Files for the managed tables directly from Flink. The current streaming apis for Hive 2 are not suitable for our needs and we cannot use the new Hive 3 streaming api yet. This system uses the Flink state to store Hive metadata (originalTransaction, bucket, rowId, ..) Thanks for your reply because yes, when files are ordered by originalTransacion, bucket, rowId it works ! I just have to use 1 transaction instead of 2 at the moment and it will be ok.
Thanks David On 2019/11/29 11:18:05, Peter Vary <pv...@cloudera.com> wrote: > Hi David, > > Not entirely sure what you are doing here :), my guess is that you are trying > to write ACID tables outside of hive. Am I right? What is the exact use-case? > There might be better solutions out there than writing the files by hand. > > As for your question below: Yes, the files should be ordered by: > originalTransacion, bucket, rowId triple, otherwise you will get wrong > results. > > Thanks, > Peter > > > On Nov 19, 2019, at 13:30, David Morin <morin.david....@gmail.com> wrote: > > > > here after more details about ORC content and the fact we have duplicate > > rows: > > > > /delta_0011365_0011365_0000/bucket_00003 > > > > {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}} > > {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}} > > > > > > /delta_0011368_0011368_0000/bucket_00003 > > > > {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null} > > {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null} > > > > /delta_0011369_0011369_0000/bucket_00003 > > > > {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}} > > {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}} > > > > +-------------------------------------------------+-------+--+ > > | row__id | cle | > > +-------------------------------------------------+-------+--+ > > | {"transactionid":11367,"bucketid":0,"rowid":0} | 5209 | > > | {"transactionid":11369,"bucketid":0,"rowid":0} | 5211 | > > | {"transactionid":11369,"bucketid":1,"rowid":0} | 5210 | > > | {"transactionid":11369,"bucketid":2,"rowid":0} | 5214 | > > | {"transactionid":11369,"bucketid":2,"rowid":1} | 5215 | > > | {"transactionid":11365,"bucketid":3,"rowid":0} | 5218 | > > | {"transactionid":11365,"bucketid":3,"rowid":1} | 5216 | > > | {"transactionid":11369,"bucketid":3,"rowid":1} | 5216 | > > | {"transactionid":11369,"bucketid":3,"rowid":0} | 5218 | > > | {"transactionid":11369,"bucketid":4,"rowid":0} | 5217 | > > | {"transactionid":11369,"bucketid":4,"rowid":1} | 5213 | > > | {"transactionid":11369,"bucketid":7,"rowid":0} | 5212 | > > +-------------------------------------------------+-------+--+ > > > > As you can see we have duplicate rows for column "cle" 5216 and 5218 > > Do we have to keep the rowids ordered ? because this is the only difference > > I have noticed based on some tests with beeline. > > > > Thanks > > > > > > > > Le mar. 19 nov. 2019 à 00:18, David Morin <morin.david....@gmail.com > > <mailto:morin.david....@gmail.com>> a écrit : > > Hello, > > > > I'm trying to understand the purpose of the rowid column inside ORC delta > > file > > {"transactionid":11359,"bucketid":5,"rowid":0} > > Orc view: > > {"operation":0,"originalTransaction":11359,"bucket":5,"rowId":0,"currentTransaction":11359,"row":...} > > I use HDP 2.6 => Hive 2 > > > > If I want to be idempotent with INSERT / DELETE / INSERT. > > Do we have to keep the same rowid ? > > It seems that when the rowid is changed during the second INSERT I have a > > duplicate row. > > For me, I can create a new rowid for the new transaction during the second > > INSERT but that seems to generate duplicate records. > > > > Regards, > > David > > > > > > > >