Hi Bill,
Did you try using Presto (from EMR) to query HUDI tables on S3, and it could
support real time queries. And you have to partition your data properly to
minimize the amount of data each query has to scan/process.
Regards,
Felix K Jose
From: Jialun Liu
Date: Saturday, June 5, 2021 at
originated from outside of Philips, be careful for
phishing.
No worries. Is the custom build something you can work with the AWS team to
get installed to be able to test ?
-Nishith
On Wed, Apr 14, 2021 at 12:57 PM Kizhakkel Jose, Felix
wrote:
> Hi Nishith, Vinoth,
>
> Thank yo
Hi Nishith, Vinoth,
Thank you so much for the quick response and offering the help.
Regards,
Felix K Jose
From: Kizhakkel Jose, Felix
Date: Wednesday, April 14, 2021 at 3:55 PM
To: dev@hudi.apache.org
Subject: Re: GDPR deletes and Consenting deletes of data from hudi table
Caution: This e-mail
, that the streaming job
> is writing to, then only one of them will succeed.
>
> We are working on a design for true lock free concurrency control, which
> provides the benefits of both models. But, won't be there for another month
> or two.
>
> Thanks
> Vinoth
>
>
> O
e, Apr 13, 2021 at 1:45 PM Kizhakkel Jose, Felix
wrote:
> Hi All,
>
> I have 100s of HUDI tables (AWS S3) where each of those are populated via
> Spark structured streaming from kafka streams. Now I have to delete records
> for a given user (userId) from all the tables which h
Hi All,
I have 100s of HUDI tables (AWS S3) where each of those are populated via Spark
structured streaming from kafka streams. Now I have to delete records for a
given user (userId) from all the tables which has data for that user. Meaning
all tables where we have reference to that specific
AM Vinoth Chandar wrote:
> Makes a lot of sense to add IMO.
>
> Satish, since you proposed this thread. what do you suggest as next steps?
> Does this deserve a RFC?
>
> On Wed, Feb 10, 2021 at 5:00 AM Kizhakkel Jose, Felix
> wrote:
>
> > Hi Vinoth,
> >
>
keep going :)
Thanks
Vinoth
On Tue, Feb 9, 2021 at 5:44 PM Kizhakkel Jose, Felix
wrote:
> Hello All,
> I would like to sort records in each file on COW table by a given key
> while ingesting/writing data - I am using Spark Data source + Kafka
> (Structured Streaming).
> HUDI is doing
Hello All,
I would like to sort records in each file on COW table by a given key while
ingesting/writing data - I am using Spark Data source + Kafka (Structured
Streaming).
HUDI is doing a great thing of getting each file to the optimal file size, (by
compaction and appending data to smaller
egards,
Felix K Jose
From: Vinoth Chandar
Date: Tuesday, November 24, 2020 at 5:52 PM
To: Sivabalan
Cc: Kizhakkel Jose, Felix , Raymond Xu
, dev@hudi.apache.org
Subject: Re: Hudi Record Key Best Practices
Agree with Siva's suggestions.
For clustering, it's not necessary for it to be part of t
duplicate info knowing that you already partitioned it by that field.
For 3, it seems too long for a primary id.
Hope this helps.
On Mon, Nov 23, 2020 at 6:25 PM Kizhakkel Jose, Felix
mailto:felix.j...@philips.com>> wrote:
@Vinoth Chandar<mailto:vin...@apache.org>,
Could you please
@Vinoth Chandar<mailto:vin...@apache.org>,
Could you please take a look at and let me know what is the best approach or
could you see whom can help me on this?
Regards,
Felix K Jose
From: Kizhakkel Jose, Felix
Date: Thursday, November 19, 2020 at 12:04 PM
To: dev@hudi.apache.org ,
18, 2020 at 7:38 AM Kizhakkel Jose, Felix
wrote:
> Hi Raymond,
> Thank you for the response.
>
> Yes, the virtual key definitely going to help reducing the storage
> footprint. When do you think it is going to be available and will it be
> compatible with all downstream processi
Hi Balaji,
Is bucketing implementation in HUDI adhering to Hive Style bucketing [Hive
murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y)].? As it’s
the bucketing style all downstream processing engines compatible with.
Regards,
Felix K Jose
From: Balaji Varadarajan
Date:
improving on the partition field? to
have more even writes across partitions for eg?
On Sat, Nov 14, 2020 at 8:46 PM Kizhakkel Jose, Felix
wrote:
> Hello All,
>
> I have asked generic questions regarding record key in slack channel, but
> I just want to consolidate everything regardin
Hello All,
I have asked generic questions regarding record key in slack channel, but I
just want to consolidate everything regarding Record Key and the suggested best
practices of Record Key construction to get better write performance.
Table Type: COW
Partition Path: Date
My record
Hello All,
Hive has the bucketBy feature and spark is going to add support for HIVE style
bucketBy support for data sources and once it’s implemented - its going to
benefit largely on the read performance. So as HUDI is having different path
while writing parquet data, are we planning to add
17 matches
Mail list logo