Re: [VOTE] Release 0.14.1, release candidate #1

2023-12-25 Thread Nicolas Paris
-1 (non binding) ran our internal test suite on 0.14.1-rc1 and found 2 issues on hudi third parties: - datadog: https://github.com/apache/hudi/issues/10403 - dynamodb lock provider: https://github.com/apache/hudi/issues/10394 Proposed a PR for each. On Sun, 2023-12-24 at 07:01 -0800, Sivabalan

Re: [External] Current state of parquet zstd OOM with hudi

2023-11-21 Thread Nicolas Paris
We fixed the hudi memory leak by patching parquet 1.12 and rely on gradle to overwrite the transitive dependencies of parquet with that latest version. I would say an entry in the hudi FAQ on this issue would be great, since hard to spot, and marked as fixed on spark side. Also we didn't

Re: [External] Current state of parquet zstd OOM with hudi

2023-11-20 Thread Nicolas Paris
Following up on this, only spark 3.5.x ships with fixed parquet version 0.13.x. It's available for latest hudi 0.14 only. If i replace parquet in previous version of spark i likely breaks the reader/writers since methods have been changed in parquet. Right now I will experiment with 3.5 and  

Re: [External] Current state of parquet zstd OOM with hudi

2023-11-20 Thread Nicolas Paris
gt; version and check if it is fixed w/O change anything in hudi > From: "nicolas paris" > Date: Mon, Nov 20, 2023, 20:07 > Subject: [External] Current state of parquet zstd OOM with hudi > To: "Hudi Dev List" > hey month ago someone spotted memory leak while reading

Current state of parquet zstd OOM with hudi

2023-11-20 Thread nicolas paris
hey month ago someone spotted memory leak while reading zstd files with hudi https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280 since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0 https://issues.apache.org/jira/browse/SPARK-41952 we are currently on spark 3.2.4, hudi

Tuning guide question about off-heap

2023-11-20 Thread nicolas paris
hi everyone, from the tuning guide: > Off-heap memory : Hudi writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like spark.executor.memoryOverhead or spark.driver.memoryOverhead, if you are running into such failures. can

Re: Improved MOR spark reader

2023-07-24 Thread Nicolas Paris
OR; see draft RFC here: https://github.com/apache/hudi/pull/9235. >Feel free to give feedback there. > >Best, >- Ethan > >On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris >wrote: > >> Just to clarify: the read path described is all about RT views here only, >> not

Re: Improved MOR spark reader

2023-07-22 Thread Nicolas Paris
Just to clarify: the read path described is all about RT views here only, not related to RO. On July 22, 2023 8:14:09 PM UTC, Nicolas Paris wrote: >I have been playing with the starrocks MOR hudi reader recently and it does an >amazing work: it has two read paths: > >1. For partiti

Improved MOR spark reader

2023-07-22 Thread Nicolas Paris
I have been playing with the starrocks MOR hudi reader recently and it does an amazing work: it has two read paths: 1. For partitions with log files, use the merging logic 2. For partitions with only parquet files, use the cow read logic As you know, the first path is slow bcoz it has merging

Re: Discuss fast copy on write rfc-68

2023-07-21 Thread Nicolas Paris
UTC, Nicolas Paris wrote: >Spliting parquet file into 5 row groups, leads to same benefit as creating 5 >parquet files each 1 row group instead. > >Also the later can involve more parallelism for writes. > >Am I missing something? > >On July 20, 2023 12:38:54 PM UTC, sa

Re: Discuss fast copy on write rfc-68

2023-07-20 Thread Nicolas Paris
ing updates should not be that tricky. > >Regards, >Sagar > >On Thu, Jul 20, 2023 at 3:26 PM nicolas paris >wrote: > >> Hi, >> >> Multiple idenpendant initiatives for fast copy on write have emerged >> (correct me if I am wrong): >> 1. >> >> htt

Discuss fast copy on write rfc-68

2023-07-20 Thread nicolas paris
Hi, Multiple idenpendant initiatives for fast copy on write have emerged (correct me if I am wrong): 1. https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md 2. https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/ The idea is to

Re: Record level index with not unique keys

2023-07-13 Thread nicolas paris
30298, 0, 1689147210233} ]| On Thu, 2023-07-13 at 10:17 -0700, Prashant Wason wrote: > Hi Nicolas, > > The RI feature is designed for max performance as it is at a record- > count > scale. Hence, the schema is simplified and minimized. > > With non unique keys

Record level index with not unique keys

2023-07-12 Thread nicolas paris
hi there, Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW (rfc-68) will be based on RLI to get the parquet offsets and allow targeting parquet row groups. RLI is a global index, therefore it assumes the hudi key is present in at most one parquet file. As a result in the

Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Nicolas Paris
Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use case to do hudi => Kafka and would enjoy building a more general tool. However we need a rfc basis to start some effort in the right way On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar wrote: >Cool. lets draw up a RFC

Re: Calling for 0.13.1 Release

2023-05-04 Thread nicolas paris
Hi, any timeline for the 0.13.1 bugfix release ? may that one be added to the prep branch https://github.com/apache/hudi/pull/8432 On Thu, 2023-03-09 at 11:21 -0600, Shiyan Xu wrote: > thanks for volunteering! let's collab on the release work > > On Sun, Mar 5, 2023 at 8:16 PM Forward Xu >

Re: [REVERT] [VOTE] Release 0.12.0, release candidate #1

2022-10-07 Thread Nicolas Paris
Hi dev team, I take this opportunity to also propose to land this tiny fix which lead us not to use the spark-bundle due to conflicts with other libs. https://github.com/apache/hudi/pull/6874 In any case, thanks ! On Fri, 2022-10-07 at 18:43 +0800, Shiyan Xu wrote: > Thank you, Zhaojing, for

Re: Updates on 0.11.1 release

2022-06-10 Thread Nicolas Paris
Thanks to the community support, I have closed that issue, and commenting the reason. glad to see 0.11.1 soon On Fri Jun 10, 2022 at 11:33 AM CEST, Nicolas Paris wrote: > Hi team > > I likely spotted a blocker issue with the incremental cleaning service > which is a blocker

Re: Updates on 0.11.1 release

2022-06-10 Thread Nicolas Paris
Hi team I likely spotted a blocker issue with the incremental cleaning service which is a blocker on our side to scale cleaning on large tables. See https://github.com/apache/hudi/issues/5835 Please tell me if my email does not respect the release process On Wed Jun 8, 2022 at 1:39 AM CEST, Y

Re: spark 3.2.1 built-in bloom filters

2022-05-19 Thread Nicolas Paris
gt; engines for point-ish lookups. > > Hope that helps > > Thanks > Vinoth > > > > > On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris > wrote: > > > Hi, > > > > spark 3.2 ships parquet 1.12 which provides built-in bloom filters on > > arbirtr

Re: spark 3.2.1 built-in bloom filters

2022-04-02 Thread Nicolas Paris
sm is stable, we plan to stop writing out > bloom > filters in parquet and also integrate the Hudi MDT with different > query > engines for point-ish lookups. > > Hope that helps > > Thanks > Vinoth > > > > > On Mon, Mar 28, 2022 at 9:57 AM Nicolas

spark 3.2.1 built-in bloom filters

2022-03-28 Thread Nicolas Paris
Hi, spark 3.2 ships parquet 1.12 which provides built-in bloom filters on arbirtrary columns. I wonder if: - hudi can benefit from them ? (likely in 0.11, but not with MOR tables) - would make sense to replace the hudi blooms with them ? - what would be the advantage of storing our blooms in

Re: [ANNOUNCE] Apache Hudi 0.10.1 released

2022-01-29 Thread Nicolas Paris
congrats what about also posting releases into the apache announce mailing list annou...@apache.org On Fri Jan 28, 2022 at 1:39 PM CET, Sivabalan wrote: > The Apache Hudi team is pleased to announce the release of Apache > > Hudi 0.10.1. > > > Apache Hudi (pronounced Hoodie) stands for Hadoop

Re: Limitations of non unique keys

2021-11-03 Thread Nicolas Paris
any column. not just the key. > In another words, we are generalizing this so hudi feels more like MySQL > and not HBase/Cassandra (key value store). Thats the direction we are > approaching. > > love to hear more feedback. > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris > wro

Re: Limitations of non unique keys

2021-11-02 Thread Nicolas Paris
for example does the move of blooms into hfiles (0.10.0 feature) makes unique bloom keys mandatory ? On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote: > > > Are you asking if there are advantages to allowing duplicates or not having > > keys in your table? > it's

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Nicolas Paris
o the hudi: df_hudi_keys.options(**hudi_options).save(...) Then a full featured / documented hoodie client is maybe the best option thought ? On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote: > Sounds great! > > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris > wrote: > >

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-26 Thread Nicolas Paris
rowse/HUDI-1295 > > Please let us know if you are interested in testing that when the PR is > up. > > Thanks > Vinoth > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris > wrote: > > > hi ! > > > > In my use case, for GDPR I have to export all informations of a

Limitations of non unique keys

2021-10-26 Thread Nicolas Paris
Hi devs, AFAIK, hudi has been designed to have primary keys in the hudi's key. However it is possible to also choose a non unique field. I have listed several trouble with such design: Non unique key yield to : - cannot delete / update a unique record - cannot apply primary key for new sql

feature request/proposal: leverage bloom indexes for readingb

2021-10-19 Thread Nicolas Paris
hi ! In my use case, for GDPR I have to export all informations of a given user from several hudi HUGE tables. Filtering the table results in a full scan of around 10 hours and this will get worst year after year. Since the filter criteria is based on the bloom key (user_id) it would be handy to