Re: Handling delta

2020-07-19 Thread Vinoth Chandar
Thanks everyone for helping out prakash!

On Thu, Jul 16, 2020 at 10:24 AM Sivaprakash 
wrote:

> Great !!
>
> Got it working !!
>
> 'hoodie.datasource.write.recordkey.field': 'COL1,COL2',
> 'hoodie.datasource.write.keygenerator.class':
> 'org.apache.hudi.keygen.ComplexKeyGenerator',
>
> Thank you.
>
> On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman  wrote:
>
> > Hi Sivaprakash,
> > To be able to specify multiple keys, in a comma separated notation, you
> > must also set the KEYGENERATOR_CLASS_OPT_KEY to
> > classOf[ComplexKeyGenerator].getName. Please see description here:
> > https://hudi.apache.org/docs/writing_data.html#datasource-writer.
> >
> > Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the
> > hoodie.datasource.write.recordkey.field configuration.
> >
> > Thanks,
> >
> > Adam Feldman
> >
> > On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash <
> > [email protected]>
> > wrote:
> >
> > > Looks like this property does the trick
> > >
> > > Property: hoodie.datasource.write.recordkey.field, Default: uuid
> > > Record key field. Value to be used as the recordKey component of
> > HoodieKey.
> > > Actual value will be obtained by invoking .toString() on the field
> value.
> > > Nested fields can be specified using the dot notation eg: a.b.c
> > >
> > > However I couldn't provide more than one column like this... COL1.COL2
> > >
> > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
> > >
> > > Anything wrong with the syntax? (tried with comma as well)
> > >
> > >
> > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <
> > > [email protected]>
> > > wrote:
> > >
> > > > Hello Balaji
> > > >
> > > > Thank you for your info !!
> > > >
> > > > I tried those options but what I find is (I'm trying to understand
> how
> > > > hudi internally manages its files)
> > > >
> > > > First Write
> > > >
> > > > 1.
> > > >
> > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > > ('NR002', 'TRE', 'TRE_445343')
> > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > > >
> > > > Commit time for all the records 20200716212533
> > > >
> > > > 2.
> > > >
> > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > > >
> > > > (There is only one record change in my new dataset other two records
> > are
> > > > same as 1 but after snapshot/incremental read I see that commit time
> is
> > > > updated for all 3 records)
> > > >
> > > > Commit time for all the records 20200716214544
> > > >
> > > >
> > > >- Does it mean that Hudi re-creates 3 records again? I thought it
> > > >would create only the 2nd record
> > > >- Trying to understand the storage volume efficiency here
> > > >- Some configuration has to be enabled to fix this?
> > > >
> > > > configuration that I use
> > > >
> > > >- COPY_ON_WRITE, Append, Upsert
> > > >- First Column (NR001) is configured as
> > > >*hoodie.datasource.write.recordkey.field*
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> > > >  wrote:
> > > >
> > > >>  Hi Sivaprakash,
> > > >> Uniqueness of records is determined by the record key you specify to
> > > >> hudi. Hudi supports filtering out existing records (by record key).
> By
> > > >> default, it would upsert all incoming records.
> > > >> Please look at
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> > > for
> > > >> information on how to dedupe records based on record key.
> > > >>
> > > >> Balaji.V
> > > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> > > >> [email protected]> wrote:
> > > >>
> > > >>  This might be a basic question - I'm experimenting with Hudi
> > > (Pyspark). I
> > > >> have used Insert/Upsert options to write delta into my data lake.
> > > However,
> > > >> one is not clear to me
> > > >>
> > > >> Step 1:- I write 50 records
> > > >> Step 2:- Im writing 50 records out of which only *10 records have
> been
> > > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> > > >> COPY_ON_WRITE)
> > > >> Step 3: I was expecting only 10 records will be written but it
> writes
> > > >> whole
> > > >> 50 records is this a normal behaviour? Which means do I need to
> > > determine
> > > >> the delta myself and write them alone?
> > > >>
> > > >> Am I missing something?
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > - Prakash.
> > > >
> > >
> > >
> > > --
> > > - Prakash.
> > >
> >
>
>
> --
> - Prakash.
>


Re: Handling delta

2020-07-16 Thread Sivaprakash
Great !!

Got it working !!

'hoodie.datasource.write.recordkey.field': 'COL1,COL2',
'hoodie.datasource.write.keygenerator.class':
'org.apache.hudi.keygen.ComplexKeyGenerator',

Thank you.

On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman  wrote:

> Hi Sivaprakash,
> To be able to specify multiple keys, in a comma separated notation, you
> must also set the KEYGENERATOR_CLASS_OPT_KEY to
> classOf[ComplexKeyGenerator].getName. Please see description here:
> https://hudi.apache.org/docs/writing_data.html#datasource-writer.
>
> Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the
> hoodie.datasource.write.recordkey.field configuration.
>
> Thanks,
>
> Adam Feldman
>
> On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash <
> [email protected]>
> wrote:
>
> > Looks like this property does the trick
> >
> > Property: hoodie.datasource.write.recordkey.field, Default: uuid
> > Record key field. Value to be used as the recordKey component of
> HoodieKey.
> > Actual value will be obtained by invoking .toString() on the field value.
> > Nested fields can be specified using the dot notation eg: a.b.c
> >
> > However I couldn't provide more than one column like this... COL1.COL2
> >
> > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
> >
> > Anything wrong with the syntax? (tried with comma as well)
> >
> >
> > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <
> > [email protected]>
> > wrote:
> >
> > > Hello Balaji
> > >
> > > Thank you for your info !!
> > >
> > > I tried those options but what I find is (I'm trying to understand how
> > > hudi internally manages its files)
> > >
> > > First Write
> > >
> > > 1.
> > >
> > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > ('NR002', 'TRE', 'TRE_445343')
> > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > >
> > > Commit time for all the records 20200716212533
> > >
> > > 2.
> > >
> > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> > >
> > > (There is only one record change in my new dataset other two records
> are
> > > same as 1 but after snapshot/incremental read I see that commit time is
> > > updated for all 3 records)
> > >
> > > Commit time for all the records 20200716214544
> > >
> > >
> > >- Does it mean that Hudi re-creates 3 records again? I thought it
> > >would create only the 2nd record
> > >- Trying to understand the storage volume efficiency here
> > >- Some configuration has to be enabled to fix this?
> > >
> > > configuration that I use
> > >
> > >- COPY_ON_WRITE, Append, Upsert
> > >- First Column (NR001) is configured as
> > >*hoodie.datasource.write.recordkey.field*
> > >
> > >
> > >
> > >
> > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> > >  wrote:
> > >
> > >>  Hi Sivaprakash,
> > >> Uniqueness of records is determined by the record key you specify to
> > >> hudi. Hudi supports filtering out existing records (by record key). By
> > >> default, it would upsert all incoming records.
> > >> Please look at
> > >>
> >
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> > for
> > >> information on how to dedupe records based on record key.
> > >>
> > >> Balaji.V
> > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> > >> [email protected]> wrote:
> > >>
> > >>  This might be a basic question - I'm experimenting with Hudi
> > (Pyspark). I
> > >> have used Insert/Upsert options to write delta into my data lake.
> > However,
> > >> one is not clear to me
> > >>
> > >> Step 1:- I write 50 records
> > >> Step 2:- Im writing 50 records out of which only *10 records have been
> > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> > >> COPY_ON_WRITE)
> > >> Step 3: I was expecting only 10 records will be written but it writes
> > >> whole
> > >> 50 records is this a normal behaviour? Which means do I need to
> > determine
> > >> the delta myself and write them alone?
> > >>
> > >> Am I missing something?
> > >>
> > >
> > >
> > >
> > > --
> > > - Prakash.
> > >
> >
> >
> > --
> > - Prakash.
> >
>


-- 
- Prakash.


Re: Handling delta

2020-07-16 Thread Adam Feldman
Hi Sivaprakash,
To be able to specify multiple keys, in a comma separated notation, you
must also set the KEYGENERATOR_CLASS_OPT_KEY to
classOf[ComplexKeyGenerator].getName. Please see description here:
https://hudi.apache.org/docs/writing_data.html#datasource-writer.

Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the
hoodie.datasource.write.recordkey.field configuration.

Thanks,

Adam Feldman

On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash 
wrote:

> Looks like this property does the trick
>
> Property: hoodie.datasource.write.recordkey.field, Default: uuid
> Record key field. Value to be used as the recordKey component of HoodieKey.
> Actual value will be obtained by invoking .toString() on the field value.
> Nested fields can be specified using the dot notation eg: a.b.c
>
> However I couldn't provide more than one column like this... COL1.COL2
>
> 'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
>
> Anything wrong with the syntax? (tried with comma as well)
>
>
> On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash <
> [email protected]>
> wrote:
>
> > Hello Balaji
> >
> > Thank you for your info !!
> >
> > I tried those options but what I find is (I'm trying to understand how
> > hudi internally manages its files)
> >
> > First Write
> >
> > 1.
> >
> > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > ('NR002', 'TRE', 'TRE_445343')
> > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> >
> > Commit time for all the records 20200716212533
> >
> > 2.
> >
> > ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> > ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
> >
> > (There is only one record change in my new dataset other two records are
> > same as 1 but after snapshot/incremental read I see that commit time is
> > updated for all 3 records)
> >
> > Commit time for all the records 20200716214544
> >
> >
> >- Does it mean that Hudi re-creates 3 records again? I thought it
> >would create only the 2nd record
> >- Trying to understand the storage volume efficiency here
> >- Some configuration has to be enabled to fix this?
> >
> > configuration that I use
> >
> >- COPY_ON_WRITE, Append, Upsert
> >- First Column (NR001) is configured as
> >*hoodie.datasource.write.recordkey.field*
> >
> >
> >
> >
> > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> >  wrote:
> >
> >>  Hi Sivaprakash,
> >> Uniqueness of records is determined by the record key you specify to
> >> hudi. Hudi supports filtering out existing records (by record key). By
> >> default, it would upsert all incoming records.
> >> Please look at
> >>
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> for
> >> information on how to dedupe records based on record key.
> >>
> >> Balaji.V
> >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> >> [email protected]> wrote:
> >>
> >>  This might be a basic question - I'm experimenting with Hudi
> (Pyspark). I
> >> have used Insert/Upsert options to write delta into my data lake.
> However,
> >> one is not clear to me
> >>
> >> Step 1:- I write 50 records
> >> Step 2:- Im writing 50 records out of which only *10 records have been
> >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> >> COPY_ON_WRITE)
> >> Step 3: I was expecting only 10 records will be written but it writes
> >> whole
> >> 50 records is this a normal behaviour? Which means do I need to
> determine
> >> the delta myself and write them alone?
> >>
> >> Am I missing something?
> >>
> >
> >
> >
> > --
> > - Prakash.
> >
>
>
> --
> - Prakash.
>


Re: Handling delta

2020-07-16 Thread Sivaprakash
Looks like this property does the trick

Property: hoodie.datasource.write.recordkey.field, Default: uuid
Record key field. Value to be used as the recordKey component of HoodieKey.
Actual value will be obtained by invoking .toString() on the field value.
Nested fields can be specified using the dot notation eg: a.b.c

However I couldn't provide more than one column like this... COL1.COL2

'hoodie.datasource.write.recordkey.field: 'COL1.COL2'

Anything wrong with the syntax? (tried with comma as well)


On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash 
wrote:

> Hello Balaji
>
> Thank you for your info !!
>
> I tried those options but what I find is (I'm trying to understand how
> hudi internally manages its files)
>
> First Write
>
> 1.
>
> ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> ('NR002', 'TRE', 'TRE_445343')
> ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
>
> Commit time for all the records 20200716212533
>
> 2.
>
> ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
>
> (There is only one record change in my new dataset other two records are
> same as 1 but after snapshot/incremental read I see that commit time is
> updated for all 3 records)
>
> Commit time for all the records 20200716214544
>
>
>- Does it mean that Hudi re-creates 3 records again? I thought it
>would create only the 2nd record
>- Trying to understand the storage volume efficiency here
>- Some configuration has to be enabled to fix this?
>
> configuration that I use
>
>- COPY_ON_WRITE, Append, Upsert
>- First Column (NR001) is configured as
>*hoodie.datasource.write.recordkey.field*
>
>
>
>
> On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
>  wrote:
>
>>  Hi Sivaprakash,
>> Uniqueness of records is determined by the record key you specify to
>> hudi. Hudi supports filtering out existing records (by record key). By
>> default, it would upsert all incoming records.
>> Please look at
>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
>>  for
>> information on how to dedupe records based on record key.
>>
>> Balaji.V
>> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
>> [email protected]> wrote:
>>
>>  This might be a basic question - I'm experimenting with Hudi (Pyspark). I
>> have used Insert/Upsert options to write delta into my data lake. However,
>> one is not clear to me
>>
>> Step 1:- I write 50 records
>> Step 2:- Im writing 50 records out of which only *10 records have been
>> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
>> COPY_ON_WRITE)
>> Step 3: I was expecting only 10 records will be written but it writes
>> whole
>> 50 records is this a normal behaviour? Which means do I need to determine
>> the delta myself and write them alone?
>>
>> Am I missing something?
>>
>
>
>
> --
> - Prakash.
>


-- 
- Prakash.


Re: Handling delta

2020-07-16 Thread Sivaprakash
Hello Balaji

Thank you for your info !!

I tried those options but what I find is (I'm trying to understand how hudi
internally manages its files)

First Write

1.

('NR001', 'YXXXTRE', 'YXXXTRE_445343')
('NR002', 'TRE', 'TRE_445343')
('NR003', 'YZZZTRE', 'YZZZTRE_445343')

Commit time for all the records 20200716212533

2.

('NR001', 'YXXXTRE', 'YXXXTRE_445343')
('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
('NR003', 'YZZZTRE', 'YZZZTRE_445343')

(There is only one record change in my new dataset other two records are
same as 1 but after snapshot/incremental read I see that commit time is
updated for all 3 records)

Commit time for all the records 20200716214544


   - Does it mean that Hudi re-creates 3 records again? I thought it would
   create only the 2nd record
   - Trying to understand the storage volume efficiency here
   - Some configuration has to be enabled to fix this?

configuration that I use

   - COPY_ON_WRITE, Append, Upsert
   - First Column (NR001) is configured as
   *hoodie.datasource.write.recordkey.field*




On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
 wrote:

>  Hi Sivaprakash,
> Uniqueness of records is determined by the record key you specify to hudi.
> Hudi supports filtering out existing records (by record key). By default,
> it would upsert all incoming records.
> Please look at
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
>  for
> information on how to dedupe records based on record key.
>
> Balaji.V
> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> [email protected]> wrote:
>
>  This might be a basic question - I'm experimenting with Hudi (Pyspark). I
> have used Insert/Upsert options to write delta into my data lake. However,
> one is not clear to me
>
> Step 1:- I write 50 records
> Step 2:- Im writing 50 records out of which only *10 records have been
> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> COPY_ON_WRITE)
> Step 3: I was expecting only 10 records will be written but it writes whole
> 50 records is this a normal behaviour? Which means do I need to determine
> the delta myself and write them alone?
>
> Am I missing something?
>



-- 
- Prakash.


Re: Handling delta

2020-07-16 Thread Balaji Varadarajan
 Hi Sivaprakash,
Uniqueness of records is determined by the record key you specify to hudi. Hudi 
supports filtering out existing records (by record key). By default, it would 
upsert all incoming records. 
Please look at 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
 for information on how to dedupe records based on record key.

Balaji.V
On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash 
 wrote:  
 
 This might be a basic question - I'm experimenting with Hudi (Pyspark). I
have used Insert/Upsert options to write delta into my data lake. However,
one is not clear to me

Step 1:- I write 50 records
Step 2:- Im writing 50 records out of which only *10 records have been
changed* (I'm using upsert mode & tried with MERGE_ON_READ also
COPY_ON_WRITE)
Step 3: I was expecting only 10 records will be written but it writes whole
50 records is this a normal behaviour? Which means do I need to determine
the delta myself and write them alone?

Am I missing something?
  

Re: Handling delta

2020-07-16 Thread Sivaprakash
Yes I'm 10 records that I mentioned from Step - 1. But, I re-write whole
dataset the second time also. I see that commit_time is getting updated for
all 50 records (which I feel normal) But I'm not sure how to see/prove to
myself that the data is not growing (to 100 records; actually it should be
only 50 records).

On Thu, Jul 16, 2020 at 4:01 PM Allen Underwood
 wrote:

> Hi Sivaprakash,
>
> So I'm by no means an expert on this, but I think you might find what
> you're looking for here:
> https://hudi.apache.org/docs/concepts.html
>
> I'm not sure I fully understand Step 2 you mentioned - I'm writing 50
> records out of which only 10 records have been changed - does that mean
> that you updated 10 records from step 1?  Or you're updating some of the
> other 40 records from step 2?
>
> Either way I guess, the key is all deltas will be written...it's after
> those records are written to disk that they are consolidated during the
> COMPACTION phase.  I *BELIEVE* this is how it works.
> Take a look at COMPACTION under the timeline section here:
> https://hudi.apache.org/docs/concepts.html#timeline
>
> Hope that helps a bit.
>
> Allen
>
> On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash <
> [email protected]> wrote:
>
>> This might be a basic question - I'm experimenting with Hudi (Pyspark). I
>> have used Insert/Upsert options to write delta into my data lake. However,
>> one is not clear to me
>>
>> Step 1:- I write 50 records
>> Step 2:- Im writing 50 records out of which only *10 records have been
>> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
>> COPY_ON_WRITE)
>> Step 3: I was expecting only 10 records will be written but it writes
>> whole
>> 50 records is this a normal behaviour? Which means do I need to determine
>> the delta myself and write them alone?
>>
>> Am I missing something?
>>
>
>
> --
> *Allen Underwood*
> Principal Software Engineer
> Broadcom | Symantec Enterprise Division
> *Mobile*: 404.808.5926
>


-- 
- Prakash.


Re: Handling delta

2020-07-16 Thread Adam Feldman
Hi Sivaprakash,
Not an expert here either, but for your second question. Yes, I believe
when writing delta to the table you must identify the actual delta yourself
and only write the new/changed/removed records. I guess we could put a
request in for hudi to take care of this, but two possible issues would be,
hudi knowing which of the columns in the table are important for the diff
or to consider all columns and that this may add significant overhead

Thanks,

On Thu, Jul 16, 2020, 10:01 Allen Underwood
 wrote:

> Hi Sivaprakash,
>
> So I'm by no means an expert on this, but I think you might find what
> you're looking for here:
> https://hudi.apache.org/docs/concepts.html
>
> I'm not sure I fully understand Step 2 you mentioned - I'm writing 50
> records out of which only 10 records have been changed - does that mean
> that you updated 10 records from step 1?  Or you're updating some of the
> other 40 records from step 2?
>
> Either way I guess, the key is all deltas will be written...it's after
> those records are written to disk that they are consolidated during the
> COMPACTION phase.  I *BELIEVE* this is how it works.
> Take a look at COMPACTION under the timeline section here:
> https://hudi.apache.org/docs/concepts.html#timeline
>
> Hope that helps a bit.
>
> Allen
>
> On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash <
> [email protected]> wrote:
>
>> This might be a basic question - I'm experimenting with Hudi (Pyspark). I
>> have used Insert/Upsert options to write delta into my data lake. However,
>> one is not clear to me
>>
>> Step 1:- I write 50 records
>> Step 2:- Im writing 50 records out of which only *10 records have been
>> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
>> COPY_ON_WRITE)
>> Step 3: I was expecting only 10 records will be written but it writes
>> whole
>> 50 records is this a normal behaviour? Which means do I need to determine
>> the delta myself and write them alone?
>>
>> Am I missing something?
>>
>
>
> --
> *Allen Underwood*
> Principal Software Engineer
> Broadcom | Symantec Enterprise Division
> *Mobile*: 404.808.5926
>


Re: Handling delta

2020-07-16 Thread Allen Underwood
Hi Sivaprakash,

So I'm by no means an expert on this, but I think you might find what
you're looking for here:
https://hudi.apache.org/docs/concepts.html

I'm not sure I fully understand Step 2 you mentioned - I'm writing 50
records out of which only 10 records have been changed - does that mean
that you updated 10 records from step 1?  Or you're updating some of the
other 40 records from step 2?

Either way I guess, the key is all deltas will be written...it's after
those records are written to disk that they are consolidated during the
COMPACTION phase.  I *BELIEVE* this is how it works.
Take a look at COMPACTION under the timeline section here:
https://hudi.apache.org/docs/concepts.html#timeline

Hope that helps a bit.

Allen

On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash 
wrote:

> This might be a basic question - I'm experimenting with Hudi (Pyspark). I
> have used Insert/Upsert options to write delta into my data lake. However,
> one is not clear to me
>
> Step 1:- I write 50 records
> Step 2:- Im writing 50 records out of which only *10 records have been
> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> COPY_ON_WRITE)
> Step 3: I was expecting only 10 records will be written but it writes whole
> 50 records is this a normal behaviour? Which means do I need to determine
> the delta myself and write them alone?
>
> Am I missing something?
>


-- 
*Allen Underwood*
Principal Software Engineer
Broadcom | Symantec Enterprise Division
*Mobile*: 404.808.5926


smime.p7s
Description: S/MIME Cryptographic Signature