Re: Could Hudi Data lake support low latency, high throughput random reads?

2021-06-05 Thread Kizhakkel Jose, Felix
Hi Bill, Did you try using Presto (from EMR) to query HUDI tables on S3, and it could support real time queries. And you have to partition your data properly to minimize the amount of data each query has to scan/process. Regards, Felix K Jose From: Jialun Liu Date: Saturday, June 5, 2021 at

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-15 Thread Kizhakkel Jose, Felix
originated from outside of Philips, be careful for phishing. No worries. Is the custom build something you can work with the AWS team to get installed to be able to test ? -Nishith On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix wrote: > Hi Nishith, Vinoth, > > Thank yo

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Kizhakkel Jose, Felix
Hi Nishith, Vinoth, Thank you so much for the quick response and offering the help. Regards, Felix K Jose From: Kizhakkel Jose, Felix Date: Wednesday, April 14, 2021 at 3:55 PM To: dev@hudi.apache.org Subject: Re: GDPR deletes and Consenting deletes of data from hudi table Caution: This e-mail

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Kizhakkel Jose, Felix
, that the streaming job > is writing to, then only one of them will succeed. > > We are working on a design for true lock free concurrency control, which > provides the benefits of both models. But, won't be there for another month > or two. > > Thanks > Vinoth > > > O

Re: GDPR deletes and Consenting deletes of data from hudi table

2021-04-14 Thread Kizhakkel Jose, Felix
e, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix wrote: > Hi All, > > I have 100s of HUDI tables (AWS S3) where each of those are populated via > Spark structured streaming from kafka streams. Now I have to delete records > for a given user (userId) from all the tables which h

GDPR deletes and Consenting deletes of data from hudi table

2021-04-13 Thread Kizhakkel Jose, Felix
Hi All, I have 100s of HUDI tables (AWS S3) where each of those are populated via Spark structured streaming from kafka streams. Now I have to delete records for a given user (userId) from all the tables which has data for that user. Meaning all tables where we have reference to that specific

Re: [DISCUSS] Improve data locality during ingestion

2021-02-17 Thread Kizhakkel Jose, Felix
AM Vinoth Chandar wrote: > Makes a lot of sense to add IMO. > > Satish, since you proposed this thread. what do you suggest as next steps? > Does this deserve a RFC? > > On Wed, Feb 10, 2021 at 5:00 AM Kizhakkel Jose, Felix > wrote: > > > Hi Vinoth, > > >

Re: [DISCUSS] Improve data locality during ingestion

2021-02-10 Thread Kizhakkel Jose, Felix
keep going :) Thanks Vinoth On Tue, Feb 9, 2021 at 5:44 PM Kizhakkel Jose, Felix wrote: > Hello All, > I would like to sort records in each file on COW table by a given key > while ingesting/writing data - I am using Spark Data source + Kafka > (Structured Streaming). > HUDI is doing

Re: [DISCUSS] Improve data locality during ingestion

2021-02-09 Thread Kizhakkel Jose, Felix
Hello All, I would like to sort records in each file on COW table by a given key while ingesting/writing data - I am using Spark Data source + Kafka (Structured Streaming). HUDI is doing a great thing of getting each file to the optimal file size, (by compaction and appending data to smaller

Re: Hudi Record Key Best Practices

2020-11-25 Thread Kizhakkel Jose, Felix
egards, Felix K Jose From: Vinoth Chandar Date: Tuesday, November 24, 2020 at 5:52 PM To: Sivabalan Cc: Kizhakkel Jose, Felix , Raymond Xu , dev@hudi.apache.org Subject: Re: Hudi Record Key Best Practices Agree with Siva's suggestions. For clustering, it's not necessary for it to be part of t

Re: Hudi Record Key Best Practices

2020-11-24 Thread Kizhakkel Jose, Felix
duplicate info knowing that you already partitioned it by that field. For 3, it seems too long for a primary id. Hope this helps. On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix mailto:felix.j...@philips.com>> wrote: @Vinoth Chandar<mailto:vin...@apache.org>, Could you please

Re: Hudi Record Key Best Practices

2020-11-23 Thread Kizhakkel Jose, Felix
@Vinoth Chandar<mailto:vin...@apache.org>, Could you please take a look at and let me know what is the best approach or could you see whom can help me on this? Regards, Felix K Jose From: Kizhakkel Jose, Felix Date: Thursday, November 19, 2020 at 12:04 PM To: dev@hudi.apache.org ,

Re: Hudi Record Key Best Practices

2020-11-19 Thread Kizhakkel Jose, Felix
18, 2020 at 7:38 AM Kizhakkel Jose, Felix wrote: > Hi Raymond, > Thank you for the response. > > Yes, the virtual key definitely going to help reducing the storage > footprint. When do you think it is going to be available and will it be > compatible with all downstream processi

Re: [EXT] Re: Bucketing in Hudi

2020-11-18 Thread Kizhakkel Jose, Felix
Hi Balaji, Is bucketing implementation in HUDI adhering to Hive Style bucketing [Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y)].? As it’s the bucketing style all downstream processing engines compatible with. Regards, Felix K Jose From: Balaji Varadarajan Date:

Re: Hudi Record Key Best Practices

2020-11-18 Thread Kizhakkel Jose, Felix
improving on the partition field? to have more even writes across partitions for eg? On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix wrote: > Hello All, > > I have asked generic questions regarding record key in slack channel, but > I just want to consolidate everything regardin

Hudi Record Key Best Practices

2020-11-14 Thread Kizhakkel Jose, Felix
Hello All, I have asked generic questions regarding record key in slack channel, but I just want to consolidate everything regarding Record Key and the suggested best practices of Record Key construction to get better write performance. Table Type: COW Partition Path: Date My record

Hudi Writer vs Spark Parquet Writer - Sync

2020-08-30 Thread Kizhakkel Jose, Felix
Hello All, Hive has the bucketBy feature and spark is going to add support for HIVE style bucketBy support for data sources and once it’s implemented - its going to benefit largely on the read performance. So as HUDI is having different path while writing parquet data, are we planning to add