Thanks David, Hope that Hive 3 streaming will help you soon to avoid these kind of headaches :) Peter
> On Dec 1, 2019, at 17:57, David Morin <morin.david....@gmail.com> wrote: > > Hi Peter, > > At the moment I have a pipeline based on Flink to write Orc Files. These Orc > Files can be read from Hive thanks to external tables and, then, a merge > statement (triggered by oozie) push these data into tables managed by Hive > (transactional tables => ORC). Hive version is 2.1 because this is the one > provided by HDP 2.6.5. > We've developed a system that write Hive Delta Files for the managed tables > directly from Flink. > The current streaming apis for Hive 2 are not suitable for our needs and we > cannot use the new Hive 3 streaming api yet. This system uses the Flink state > to store Hive metadata (originalTransaction, bucket, rowId, ..) > Thanks for your reply because yes, when files are ordered by > originalTransacion, bucket, rowId > it works ! I just have to use 1 transaction instead of 2 at the moment and it > will be ok. > > Thanks > David > > On 2019/11/29 11:18:05, Peter Vary <pv...@cloudera.com> wrote: >> Hi David, >> >> Not entirely sure what you are doing here :), my guess is that you are >> trying to write ACID tables outside of hive. Am I right? What is the exact >> use-case? There might be better solutions out there than writing the files >> by hand. >> >> As for your question below: Yes, the files should be ordered by: >> originalTransacion, bucket, rowId triple, otherwise you will get wrong >> results. >> >> Thanks, >> Peter >> >>> On Nov 19, 2019, at 13:30, David Morin <morin.david....@gmail.com> wrote: >>> >>> here after more details about ORC content and the fact we have duplicate >>> rows: >>> >>> /delta_0011365_0011365_0000/bucket_00003 >>> >>> {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11365,"row":{"TS":1574156027915254212,"cle":5218,...}} >>> {"operation":0,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11365,"row":{"TS":1574156027915075038,"cle":5216,...}} >>> >>> >>> /delta_0011368_0011368_0000/bucket_00003 >>> >>> {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":1,"currentTransaction":11368,"row":null} >>> {"operation":2,"originalTransaction":11365,"bucket":3,"rowId":0,"currentTransaction":11368,"row":null} >>> >>> /delta_0011369_0011369_0000/bucket_00003 >>> >>> {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":1,"currentTransaction":11369,"row":{"TS":1574157407855174144,"cle":5216,...}} >>> {"operation":0,"originalTransaction":11369,"bucket":3,"rowId":0,"currentTransaction":11369,"row":{"TS":1574157407855265906,"cle":5218,...}} >>> >>> +-------------------------------------------------+-------+--+ >>> | row__id | cle | >>> +-------------------------------------------------+-------+--+ >>> | {"transactionid":11367,"bucketid":0,"rowid":0} | 5209 | >>> | {"transactionid":11369,"bucketid":0,"rowid":0} | 5211 | >>> | {"transactionid":11369,"bucketid":1,"rowid":0} | 5210 | >>> | {"transactionid":11369,"bucketid":2,"rowid":0} | 5214 | >>> | {"transactionid":11369,"bucketid":2,"rowid":1} | 5215 | >>> | {"transactionid":11365,"bucketid":3,"rowid":0} | 5218 | >>> | {"transactionid":11365,"bucketid":3,"rowid":1} | 5216 | >>> | {"transactionid":11369,"bucketid":3,"rowid":1} | 5216 | >>> | {"transactionid":11369,"bucketid":3,"rowid":0} | 5218 | >>> | {"transactionid":11369,"bucketid":4,"rowid":0} | 5217 | >>> | {"transactionid":11369,"bucketid":4,"rowid":1} | 5213 | >>> | {"transactionid":11369,"bucketid":7,"rowid":0} | 5212 | >>> +-------------------------------------------------+-------+--+ >>> >>> As you can see we have duplicate rows for column "cle" 5216 and 5218 >>> Do we have to keep the rowids ordered ? because this is the only difference >>> I have noticed based on some tests with beeline. >>> >>> Thanks >>> >>> >>> >>> Le mar. 19 nov. 2019 à 00:18, David Morin <morin.david....@gmail.com >>> <mailto:morin.david....@gmail.com>> a écrit : >>> Hello, >>> >>> I'm trying to understand the purpose of the rowid column inside ORC delta >>> file >>> {"transactionid":11359,"bucketid":5,"rowid":0} >>> Orc view: >>> {"operation":0,"originalTransaction":11359,"bucket":5,"rowId":0,"currentTransaction":11359,"row":...} >>> I use HDP 2.6 => Hive 2 >>> >>> If I want to be idempotent with INSERT / DELETE / INSERT. >>> Do we have to keep the same rowid ? >>> It seems that when the rowid is changed during the second INSERT I have a >>> duplicate row. >>> For me, I can create a new rowid for the new transaction during the second >>> INSERT but that seems to generate duplicate records. >>> >>> Regards, >>> David >>> >>> >>> >> >>