Re: Handling delta
Thanks everyone for helping out prakash! On Thu, Jul 16, 2020 at 10:24 AM Sivaprakash wrote: > Great !! > > Got it working !! > > 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', > 'hoodie.datasource.write.keygenerator.class': > 'org.apache.hudi.keygen.ComplexKeyGenerator', > > Thank you. > > On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman wrote: > > > Hi Sivaprakash, > > To be able to specify multiple keys, in a comma separated notation, you > > must also set the KEYGENERATOR_CLASS_OPT_KEY to > > classOf[ComplexKeyGenerator].getName. Please see description here: > > https://hudi.apache.org/docs/writing_data.html#datasource-writer. > > > > Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the > > hoodie.datasource.write.recordkey.field configuration. > > > > Thanks, > > > > Adam Feldman > > > > On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash < > > [email protected]> > > wrote: > > > > > Looks like this property does the trick > > > > > > Property: hoodie.datasource.write.recordkey.field, Default: uuid > > > Record key field. Value to be used as the recordKey component of > > HoodieKey. > > > Actual value will be obtained by invoking .toString() on the field > value. > > > Nested fields can be specified using the dot notation eg: a.b.c > > > > > > However I couldn't provide more than one column like this... COL1.COL2 > > > > > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' > > > > > > Anything wrong with the syntax? (tried with comma as well) > > > > > > > > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash < > > > [email protected]> > > > wrote: > > > > > > > Hello Balaji > > > > > > > > Thank you for your info !! > > > > > > > > I tried those options but what I find is (I'm trying to understand > how > > > > hudi internally manages its files) > > > > > > > > First Write > > > > > > > > 1. > > > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > > ('NR002', 'TRE', 'TRE_445343') > > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > > > Commit time for all the records 20200716212533 > > > > > > > > 2. > > > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343') > > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > > > (There is only one record change in my new dataset other two records > > are > > > > same as 1 but after snapshot/incremental read I see that commit time > is > > > > updated for all 3 records) > > > > > > > > Commit time for all the records 20200716214544 > > > > > > > > > > > >- Does it mean that Hudi re-creates 3 records again? I thought it > > > >would create only the 2nd record > > > >- Trying to understand the storage volume efficiency here > > > >- Some configuration has to be enabled to fix this? > > > > > > > > configuration that I use > > > > > > > >- COPY_ON_WRITE, Append, Upsert > > > >- First Column (NR001) is configured as > > > >*hoodie.datasource.write.recordkey.field* > > > > > > > > > > > > > > > > > > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan > > > > wrote: > > > > > > > >> Hi Sivaprakash, > > > >> Uniqueness of records is determined by the record key you specify to > > > >> hudi. Hudi supports filtering out existing records (by record key). > By > > > >> default, it would upsert all incoming records. > > > >> Please look at > > > >> > > > > > > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput > > > for > > > >> information on how to dedupe records based on record key. > > > >> > > > >> Balaji.V > > > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash < > > > >> [email protected]> wrote: > > > >> > > > >> This might be a basic question - I'm experimenting with Hudi > > > (Pyspark). I > > > >> have used Insert/Upsert options to write delta into my data lake. > > > However, > > > >> one is not clear to me > > > >> > > > >> Step 1:- I write 50 records > > > >> Step 2:- Im writing 50 records out of which only *10 records have > been > > > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also > > > >> COPY_ON_WRITE) > > > >> Step 3: I was expecting only 10 records will be written but it > writes > > > >> whole > > > >> 50 records is this a normal behaviour? Which means do I need to > > > determine > > > >> the delta myself and write them alone? > > > >> > > > >> Am I missing something? > > > >> > > > > > > > > > > > > > > > > -- > > > > - Prakash. > > > > > > > > > > > > > -- > > > - Prakash. > > > > > > > > -- > - Prakash. >
Re: Handling delta
Great !! Got it working !! 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', Thank you. On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman wrote: > Hi Sivaprakash, > To be able to specify multiple keys, in a comma separated notation, you > must also set the KEYGENERATOR_CLASS_OPT_KEY to > classOf[ComplexKeyGenerator].getName. Please see description here: > https://hudi.apache.org/docs/writing_data.html#datasource-writer. > > Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the > hoodie.datasource.write.recordkey.field configuration. > > Thanks, > > Adam Feldman > > On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash < > [email protected]> > wrote: > > > Looks like this property does the trick > > > > Property: hoodie.datasource.write.recordkey.field, Default: uuid > > Record key field. Value to be used as the recordKey component of > HoodieKey. > > Actual value will be obtained by invoking .toString() on the field value. > > Nested fields can be specified using the dot notation eg: a.b.c > > > > However I couldn't provide more than one column like this... COL1.COL2 > > > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' > > > > Anything wrong with the syntax? (tried with comma as well) > > > > > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash < > > [email protected]> > > wrote: > > > > > Hello Balaji > > > > > > Thank you for your info !! > > > > > > I tried those options but what I find is (I'm trying to understand how > > > hudi internally manages its files) > > > > > > First Write > > > > > > 1. > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > ('NR002', 'TRE', 'TRE_445343') > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > Commit time for all the records 20200716212533 > > > > > > 2. > > > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343') > > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > > > (There is only one record change in my new dataset other two records > are > > > same as 1 but after snapshot/incremental read I see that commit time is > > > updated for all 3 records) > > > > > > Commit time for all the records 20200716214544 > > > > > > > > >- Does it mean that Hudi re-creates 3 records again? I thought it > > >would create only the 2nd record > > >- Trying to understand the storage volume efficiency here > > >- Some configuration has to be enabled to fix this? > > > > > > configuration that I use > > > > > >- COPY_ON_WRITE, Append, Upsert > > >- First Column (NR001) is configured as > > >*hoodie.datasource.write.recordkey.field* > > > > > > > > > > > > > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan > > > wrote: > > > > > >> Hi Sivaprakash, > > >> Uniqueness of records is determined by the record key you specify to > > >> hudi. Hudi supports filtering out existing records (by record key). By > > >> default, it would upsert all incoming records. > > >> Please look at > > >> > > > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput > > for > > >> information on how to dedupe records based on record key. > > >> > > >> Balaji.V > > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash < > > >> [email protected]> wrote: > > >> > > >> This might be a basic question - I'm experimenting with Hudi > > (Pyspark). I > > >> have used Insert/Upsert options to write delta into my data lake. > > However, > > >> one is not clear to me > > >> > > >> Step 1:- I write 50 records > > >> Step 2:- Im writing 50 records out of which only *10 records have been > > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also > > >> COPY_ON_WRITE) > > >> Step 3: I was expecting only 10 records will be written but it writes > > >> whole > > >> 50 records is this a normal behaviour? Which means do I need to > > determine > > >> the delta myself and write them alone? > > >> > > >> Am I missing something? > > >> > > > > > > > > > > > > -- > > > - Prakash. > > > > > > > > > -- > > - Prakash. > > > -- - Prakash.
Re: Handling delta
Hi Sivaprakash, To be able to specify multiple keys, in a comma separated notation, you must also set the KEYGENERATOR_CLASS_OPT_KEY to classOf[ComplexKeyGenerator].getName. Please see description here: https://hudi.apache.org/docs/writing_data.html#datasource-writer. Note: RECORDKEY_FIELD_OPT_KEY is the key variable mapped to the hoodie.datasource.write.recordkey.field configuration. Thanks, Adam Feldman On Thu, Jul 16, 2020 at 1:00 PM Sivaprakash wrote: > Looks like this property does the trick > > Property: hoodie.datasource.write.recordkey.field, Default: uuid > Record key field. Value to be used as the recordKey component of HoodieKey. > Actual value will be obtained by invoking .toString() on the field value. > Nested fields can be specified using the dot notation eg: a.b.c > > However I couldn't provide more than one column like this... COL1.COL2 > > 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' > > Anything wrong with the syntax? (tried with comma as well) > > > On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash < > [email protected]> > wrote: > > > Hello Balaji > > > > Thank you for your info !! > > > > I tried those options but what I find is (I'm trying to understand how > > hudi internally manages its files) > > > > First Write > > > > 1. > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > ('NR002', 'TRE', 'TRE_445343') > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > Commit time for all the records 20200716212533 > > > > 2. > > > > ('NR001', 'YXXXTRE', 'YXXXTRE_445343') > > ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343') > > ('NR003', 'YZZZTRE', 'YZZZTRE_445343') > > > > (There is only one record change in my new dataset other two records are > > same as 1 but after snapshot/incremental read I see that commit time is > > updated for all 3 records) > > > > Commit time for all the records 20200716214544 > > > > > >- Does it mean that Hudi re-creates 3 records again? I thought it > >would create only the 2nd record > >- Trying to understand the storage volume efficiency here > >- Some configuration has to be enabled to fix this? > > > > configuration that I use > > > >- COPY_ON_WRITE, Append, Upsert > >- First Column (NR001) is configured as > >*hoodie.datasource.write.recordkey.field* > > > > > > > > > > On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan > > wrote: > > > >> Hi Sivaprakash, > >> Uniqueness of records is determined by the record key you specify to > >> hudi. Hudi supports filtering out existing records (by record key). By > >> default, it would upsert all incoming records. > >> Please look at > >> > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput > for > >> information on how to dedupe records based on record key. > >> > >> Balaji.V > >> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash < > >> [email protected]> wrote: > >> > >> This might be a basic question - I'm experimenting with Hudi > (Pyspark). I > >> have used Insert/Upsert options to write delta into my data lake. > However, > >> one is not clear to me > >> > >> Step 1:- I write 50 records > >> Step 2:- Im writing 50 records out of which only *10 records have been > >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also > >> COPY_ON_WRITE) > >> Step 3: I was expecting only 10 records will be written but it writes > >> whole > >> 50 records is this a normal behaviour? Which means do I need to > determine > >> the delta myself and write them alone? > >> > >> Am I missing something? > >> > > > > > > > > -- > > - Prakash. > > > > > -- > - Prakash. >
Re: Handling delta
Looks like this property does the trick
Property: hoodie.datasource.write.recordkey.field, Default: uuid
Record key field. Value to be used as the recordKey component of HoodieKey.
Actual value will be obtained by invoking .toString() on the field value.
Nested fields can be specified using the dot notation eg: a.b.c
However I couldn't provide more than one column like this... COL1.COL2
'hoodie.datasource.write.recordkey.field: 'COL1.COL2'
Anything wrong with the syntax? (tried with comma as well)
On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash
wrote:
> Hello Balaji
>
> Thank you for your info !!
>
> I tried those options but what I find is (I'm trying to understand how
> hudi internally manages its files)
>
> First Write
>
> 1.
>
> ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> ('NR002', 'TRE', 'TRE_445343')
> ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
>
> Commit time for all the records 20200716212533
>
> 2.
>
> ('NR001', 'YXXXTRE', 'YXXXTRE_445343')
> ('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
> ('NR003', 'YZZZTRE', 'YZZZTRE_445343')
>
> (There is only one record change in my new dataset other two records are
> same as 1 but after snapshot/incremental read I see that commit time is
> updated for all 3 records)
>
> Commit time for all the records 20200716214544
>
>
>- Does it mean that Hudi re-creates 3 records again? I thought it
>would create only the 2nd record
>- Trying to understand the storage volume efficiency here
>- Some configuration has to be enabled to fix this?
>
> configuration that I use
>
>- COPY_ON_WRITE, Append, Upsert
>- First Column (NR001) is configured as
>*hoodie.datasource.write.recordkey.field*
>
>
>
>
> On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
> wrote:
>
>> Hi Sivaprakash,
>> Uniqueness of records is determined by the record key you specify to
>> hudi. Hudi supports filtering out existing records (by record key). By
>> default, it would upsert all incoming records.
>> Please look at
>> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
>> for
>> information on how to dedupe records based on record key.
>>
>> Balaji.V
>> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
>> [email protected]> wrote:
>>
>> This might be a basic question - I'm experimenting with Hudi (Pyspark). I
>> have used Insert/Upsert options to write delta into my data lake. However,
>> one is not clear to me
>>
>> Step 1:- I write 50 records
>> Step 2:- Im writing 50 records out of which only *10 records have been
>> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
>> COPY_ON_WRITE)
>> Step 3: I was expecting only 10 records will be written but it writes
>> whole
>> 50 records is this a normal behaviour? Which means do I need to determine
>> the delta myself and write them alone?
>>
>> Am I missing something?
>>
>
>
>
> --
> - Prakash.
>
--
- Prakash.
Re: Handling delta
Hello Balaji
Thank you for your info !!
I tried those options but what I find is (I'm trying to understand how hudi
internally manages its files)
First Write
1.
('NR001', 'YXXXTRE', 'YXXXTRE_445343')
('NR002', 'TRE', 'TRE_445343')
('NR003', 'YZZZTRE', 'YZZZTRE_445343')
Commit time for all the records 20200716212533
2.
('NR001', 'YXXXTRE', 'YXXXTRE_445343')
('NR002', 'ZYYYTRE', 'ZYYYTRE_445343')
('NR003', 'YZZZTRE', 'YZZZTRE_445343')
(There is only one record change in my new dataset other two records are
same as 1 but after snapshot/incremental read I see that commit time is
updated for all 3 records)
Commit time for all the records 20200716214544
- Does it mean that Hudi re-creates 3 records again? I thought it would
create only the 2nd record
- Trying to understand the storage volume efficiency here
- Some configuration has to be enabled to fix this?
configuration that I use
- COPY_ON_WRITE, Append, Upsert
- First Column (NR001) is configured as
*hoodie.datasource.write.recordkey.field*
On Thu, Jul 16, 2020 at 6:10 PM Balaji Varadarajan
wrote:
> Hi Sivaprakash,
> Uniqueness of records is determined by the record key you specify to hudi.
> Hudi supports filtering out existing records (by record key). By default,
> it would upsert all incoming records.
> Please look at
> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput
> for
> information on how to dedupe records based on record key.
>
> Balaji.V
> On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash <
> [email protected]> wrote:
>
> This might be a basic question - I'm experimenting with Hudi (Pyspark). I
> have used Insert/Upsert options to write delta into my data lake. However,
> one is not clear to me
>
> Step 1:- I write 50 records
> Step 2:- Im writing 50 records out of which only *10 records have been
> changed* (I'm using upsert mode & tried with MERGE_ON_READ also
> COPY_ON_WRITE)
> Step 3: I was expecting only 10 records will be written but it writes whole
> 50 records is this a normal behaviour? Which means do I need to determine
> the delta myself and write them alone?
>
> Am I missing something?
>
--
- Prakash.
Re: Handling delta
Hi Sivaprakash, Uniqueness of records is determined by the record key you specify to hudi. Hudi supports filtering out existing records (by record key). By default, it would upsert all incoming records. Please look at https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoesHudihandleduplicaterecordkeysinaninput for information on how to dedupe records based on record key. Balaji.V On Thursday, July 16, 2020, 04:23:22 AM PDT, Sivaprakash wrote: This might be a basic question - I'm experimenting with Hudi (Pyspark). I have used Insert/Upsert options to write delta into my data lake. However, one is not clear to me Step 1:- I write 50 records Step 2:- Im writing 50 records out of which only *10 records have been changed* (I'm using upsert mode & tried with MERGE_ON_READ also COPY_ON_WRITE) Step 3: I was expecting only 10 records will be written but it writes whole 50 records is this a normal behaviour? Which means do I need to determine the delta myself and write them alone? Am I missing something?
Re: Handling delta
Yes I'm 10 records that I mentioned from Step - 1. But, I re-write whole dataset the second time also. I see that commit_time is getting updated for all 50 records (which I feel normal) But I'm not sure how to see/prove to myself that the data is not growing (to 100 records; actually it should be only 50 records). On Thu, Jul 16, 2020 at 4:01 PM Allen Underwood wrote: > Hi Sivaprakash, > > So I'm by no means an expert on this, but I think you might find what > you're looking for here: > https://hudi.apache.org/docs/concepts.html > > I'm not sure I fully understand Step 2 you mentioned - I'm writing 50 > records out of which only 10 records have been changed - does that mean > that you updated 10 records from step 1? Or you're updating some of the > other 40 records from step 2? > > Either way I guess, the key is all deltas will be written...it's after > those records are written to disk that they are consolidated during the > COMPACTION phase. I *BELIEVE* this is how it works. > Take a look at COMPACTION under the timeline section here: > https://hudi.apache.org/docs/concepts.html#timeline > > Hope that helps a bit. > > Allen > > On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash < > [email protected]> wrote: > >> This might be a basic question - I'm experimenting with Hudi (Pyspark). I >> have used Insert/Upsert options to write delta into my data lake. However, >> one is not clear to me >> >> Step 1:- I write 50 records >> Step 2:- Im writing 50 records out of which only *10 records have been >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also >> COPY_ON_WRITE) >> Step 3: I was expecting only 10 records will be written but it writes >> whole >> 50 records is this a normal behaviour? Which means do I need to determine >> the delta myself and write them alone? >> >> Am I missing something? >> > > > -- > *Allen Underwood* > Principal Software Engineer > Broadcom | Symantec Enterprise Division > *Mobile*: 404.808.5926 > -- - Prakash.
Re: Handling delta
Hi Sivaprakash, Not an expert here either, but for your second question. Yes, I believe when writing delta to the table you must identify the actual delta yourself and only write the new/changed/removed records. I guess we could put a request in for hudi to take care of this, but two possible issues would be, hudi knowing which of the columns in the table are important for the diff or to consider all columns and that this may add significant overhead Thanks, On Thu, Jul 16, 2020, 10:01 Allen Underwood wrote: > Hi Sivaprakash, > > So I'm by no means an expert on this, but I think you might find what > you're looking for here: > https://hudi.apache.org/docs/concepts.html > > I'm not sure I fully understand Step 2 you mentioned - I'm writing 50 > records out of which only 10 records have been changed - does that mean > that you updated 10 records from step 1? Or you're updating some of the > other 40 records from step 2? > > Either way I guess, the key is all deltas will be written...it's after > those records are written to disk that they are consolidated during the > COMPACTION phase. I *BELIEVE* this is how it works. > Take a look at COMPACTION under the timeline section here: > https://hudi.apache.org/docs/concepts.html#timeline > > Hope that helps a bit. > > Allen > > On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash < > [email protected]> wrote: > >> This might be a basic question - I'm experimenting with Hudi (Pyspark). I >> have used Insert/Upsert options to write delta into my data lake. However, >> one is not clear to me >> >> Step 1:- I write 50 records >> Step 2:- Im writing 50 records out of which only *10 records have been >> changed* (I'm using upsert mode & tried with MERGE_ON_READ also >> COPY_ON_WRITE) >> Step 3: I was expecting only 10 records will be written but it writes >> whole >> 50 records is this a normal behaviour? Which means do I need to determine >> the delta myself and write them alone? >> >> Am I missing something? >> > > > -- > *Allen Underwood* > Principal Software Engineer > Broadcom | Symantec Enterprise Division > *Mobile*: 404.808.5926 >
Re: Handling delta
Hi Sivaprakash, So I'm by no means an expert on this, but I think you might find what you're looking for here: https://hudi.apache.org/docs/concepts.html I'm not sure I fully understand Step 2 you mentioned - I'm writing 50 records out of which only 10 records have been changed - does that mean that you updated 10 records from step 1? Or you're updating some of the other 40 records from step 2? Either way I guess, the key is all deltas will be written...it's after those records are written to disk that they are consolidated during the COMPACTION phase. I *BELIEVE* this is how it works. Take a look at COMPACTION under the timeline section here: https://hudi.apache.org/docs/concepts.html#timeline Hope that helps a bit. Allen On Thu, Jul 16, 2020 at 7:23 AM Sivaprakash wrote: > This might be a basic question - I'm experimenting with Hudi (Pyspark). I > have used Insert/Upsert options to write delta into my data lake. However, > one is not clear to me > > Step 1:- I write 50 records > Step 2:- Im writing 50 records out of which only *10 records have been > changed* (I'm using upsert mode & tried with MERGE_ON_READ also > COPY_ON_WRITE) > Step 3: I was expecting only 10 records will be written but it writes whole > 50 records is this a normal behaviour? Which means do I need to determine > the delta myself and write them alone? > > Am I missing something? > -- *Allen Underwood* Principal Software Engineer Broadcom | Symantec Enterprise Division *Mobile*: 404.808.5926 smime.p7s Description: S/MIME Cryptographic Signature
