from:"Nicolas Paris"

Re: [VOTE] Release 0.14.1, release candidate #1

2023-12-25 Thread Nicolas Paris

-1 (non binding) ran our internal test suite on 0.14.1-rc1 and found 2 issues on hudi third parties: - datadog: https://github.com/apache/hudi/issues/10403 - dynamodb lock provider: https://github.com/apache/hudi/issues/10394 Proposed a PR for each. On Sun, 2023-12-24 at 07:01 -0800, Sivabalan

Re: [External] Current state of parquet zstd OOM with hudi

2023-11-21 Thread Nicolas Paris

We fixed the hudi memory leak by patching parquet 1.12 and rely on gradle to overwrite the transitive dependencies of parquet with that latest version. I would say an entry in the hudi FAQ on this issue would be great, since hard to spot, and marked as fixed on spark side. Also we didn't

Re: [External] Current state of parquet zstd OOM with hudi

2023-11-20 Thread Nicolas Paris

Following up on this, only spark 3.5.x ships with fixed parquet version 0.13.x. It's available for latest hudi 0.14 only. If i replace parquet in previous version of spark i likely breaks the reader/writers since methods have been changed in parquet. Right now I will experiment with 3.5 and

Re: [External] Current state of parquet zstd OOM with hudi

2023-11-20 Thread Nicolas Paris

gt; version and check if it is fixed w/O change anything in hudi > From: "nicolas paris" > Date: Mon, Nov 20, 2023, 20:07 > Subject: [External] Current state of parquet zstd OOM with hudi > To: "Hudi Dev List" > hey month ago someone spotted memory leak while reading

Current state of parquet zstd OOM with hudi

2023-11-20 Thread nicolas paris

hey month ago someone spotted memory leak while reading zstd files with hudi https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280 since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0 https://issues.apache.org/jira/browse/SPARK-41952 we are currently on spark 3.2.4, hudi

Tuning guide question about off-heap

2023-11-20 Thread nicolas paris

hi everyone, from the tuning guide: > Off-heap memory : Hudi writes parquet files and that needs good amount of off-heap memory proportional to schema width. Consider setting something like spark.executor.memoryOverhead or spark.driver.memoryOverhead, if you are running into such failures. can

Re: Improved MOR spark reader

2023-07-24 Thread Nicolas Paris

OR; see draft RFC here: https://github.com/apache/hudi/pull/9235. >Feel free to give feedback there. > >Best, >- Ethan > >On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris >wrote: > >> Just to clarify: the read path described is all about RT views here only, >> not

Re: Improved MOR spark reader

2023-07-22 Thread Nicolas Paris

Just to clarify: the read path described is all about RT views here only, not related to RO. On July 22, 2023 8:14:09 PM UTC, Nicolas Paris wrote: >I have been playing with the starrocks MOR hudi reader recently and it does an >amazing work: it has two read paths: > >1. For partiti

Improved MOR spark reader

2023-07-22 Thread Nicolas Paris

I have been playing with the starrocks MOR hudi reader recently and it does an amazing work: it has two read paths: 1. For partitions with log files, use the merging logic 2. For partitions with only parquet files, use the cow read logic As you know, the first path is slow bcoz it has merging

Re: Discuss fast copy on write rfc-68

2023-07-21 Thread Nicolas Paris

UTC, Nicolas Paris wrote: >Spliting parquet file into 5 row groups, leads to same benefit as creating 5 >parquet files each 1 row group instead. > >Also the later can involve more parallelism for writes. > >Am I missing something? > >On July 20, 2023 12:38:54 PM UTC, sa

Re: Discuss fast copy on write rfc-68

2023-07-20 Thread Nicolas Paris

ing updates should not be that tricky. > >Regards, >Sagar > >On Thu, Jul 20, 2023 at 3:26 PM nicolas paris >wrote: > >> Hi, >> >> Multiple idenpendant initiatives for fast copy on write have emerged >> (correct me if I am wrong): >> 1. >> >> htt

Discuss fast copy on write rfc-68

2023-07-20 Thread nicolas paris

Hi, Multiple idenpendant initiatives for fast copy on write have emerged (correct me if I am wrong): 1. https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md 2. https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/ The idea is to

Re: Record level index with not unique keys

2023-07-13 Thread nicolas paris

30298, 0, 1689147210233} ]| On Thu, 2023-07-13 at 10:17 -0700, Prashant Wason wrote: > Hi Nicolas, > > The RI feature is designed for max performance as it is at a record- > count > scale. Hence, the schema is simplified and minimized. > > With non unique keys

Record level index with not unique keys

2023-07-12 Thread nicolas paris

hi there, Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW (rfc-68) will be based on RLI to get the parquet offsets and allow targeting parquet row groups. RLI is a global index, therefore it assumes the hudi key is present in at most one parquet file. As a result in the

Re: [DISCUSS] Hudi Reverse Streamer

2023-06-14 Thread Nicolas Paris

Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use case to do hudi => Kafka and would enjoy building a more general tool. However we need a rfc basis to start some effort in the right way On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar wrote: >Cool. lets draw up a RFC

Re: Calling for 0.13.1 Release

2023-05-04 Thread nicolas paris

Hi, any timeline for the 0.13.1 bugfix release ? may that one be added to the prep branch https://github.com/apache/hudi/pull/8432 On Thu, 2023-03-09 at 11:21 -0600, Shiyan Xu wrote: > thanks for volunteering! let's collab on the release work > > On Sun, Mar 5, 2023 at 8:16 PM Forward Xu >

Re: [REVERT] [VOTE] Release 0.12.0, release candidate #1

2022-10-07 Thread Nicolas Paris

Hi dev team, I take this opportunity to also propose to land this tiny fix which lead us not to use the spark-bundle due to conflicts with other libs. https://github.com/apache/hudi/pull/6874 In any case, thanks ! On Fri, 2022-10-07 at 18:43 +0800, Shiyan Xu wrote: > Thank you, Zhaojing, for

Re: Updates on 0.11.1 release

2022-06-10 Thread Nicolas Paris

Thanks to the community support, I have closed that issue, and commenting the reason. glad to see 0.11.1 soon On Fri Jun 10, 2022 at 11:33 AM CEST, Nicolas Paris wrote: > Hi team > > I likely spotted a blocker issue with the incremental cleaning service > which is a blocker

Re: Updates on 0.11.1 release

2022-06-10 Thread Nicolas Paris

Hi team I likely spotted a blocker issue with the incremental cleaning service which is a blocker on our side to scale cleaning on large tables. See https://github.com/apache/hudi/issues/5835 Please tell me if my email does not respect the release process On Wed Jun 8, 2022 at 1:39 AM CEST, Y

Re: spark 3.2.1 built-in bloom filters

2022-05-19 Thread Nicolas Paris

gt; engines for point-ish lookups. > > Hope that helps > > Thanks > Vinoth > > > > > On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris > wrote: > > > Hi, > > > > spark 3.2 ships parquet 1.12 which provides built-in bloom filters on > > arbirtr

Re: spark 3.2.1 built-in bloom filters

2022-04-02 Thread Nicolas Paris

sm is stable, we plan to stop writing out > bloom > filters in parquet and also integrate the Hudi MDT with different > query > engines for point-ish lookups. > > Hope that helps > > Thanks > Vinoth > > > > > On Mon, Mar 28, 2022 at 9:57 AM Nicolas

spark 3.2.1 built-in bloom filters

2022-03-28 Thread Nicolas Paris

Hi, spark 3.2 ships parquet 1.12 which provides built-in bloom filters on arbirtrary columns. I wonder if: - hudi can benefit from them ? (likely in 0.11, but not with MOR tables) - would make sense to replace the hudi blooms with them ? - what would be the advantage of storing our blooms in

Re: [ANNOUNCE] Apache Hudi 0.10.1 released

2022-01-29 Thread Nicolas Paris

congrats what about also posting releases into the apache announce mailing list annou...@apache.org On Fri Jan 28, 2022 at 1:39 PM CET, Sivabalan wrote: > The Apache Hudi team is pleased to announce the release of Apache > > Hudi 0.10.1. > > > Apache Hudi (pronounced Hoodie) stands for Hadoop

Re: Limitations of non unique keys

2021-11-03 Thread Nicolas Paris

any column. not just the key. > In another words, we are generalizing this so hudi feels more like MySQL > and not HBase/Cassandra (key value store). Thats the direction we are > approaching. > > love to hear more feedback. > > On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris > wro

Re: Limitations of non unique keys

2021-11-02 Thread Nicolas Paris

for example does the move of blooms into hfiles (0.10.0 feature) makes unique bloom keys mandatory ? On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote: > > > Are you asking if there are advantages to allowing duplicates or not having > > keys in your table? > it's

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-28 Thread Nicolas Paris

o the hudi: df_hudi_keys.options(**hudi_options).save(...) Then a full featured / documented hoodie client is maybe the best option thought ? On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote: > Sounds great! > > On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris > wrote: > >

Re: feature request/proposal: leverage bloom indexes for readingb

2021-10-26 Thread Nicolas Paris

rowse/HUDI-1295 > > Please let us know if you are interested in testing that when the PR is > up. > > Thanks > Vinoth > > On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris > wrote: > > > hi ! > > > > In my use case, for GDPR I have to export all informations of a

Limitations of non unique keys

2021-10-26 Thread Nicolas Paris

Hi devs, AFAIK, hudi has been designed to have primary keys in the hudi's key. However it is possible to also choose a non unique field. I have listed several trouble with such design: Non unique key yield to : - cannot delete / update a unique record - cannot apply primary key for new sql

feature request/proposal: leverage bloom indexes for readingb

2021-10-19 Thread Nicolas Paris

hi ! In my use case, for GDPR I have to export all informations of a given user from several hudi HUGE tables. Filtering the table results in a full scan of around 10 hours and this will get worst year after year. Since the filter criteria is based on the bloom key (user_id) it would be handy to

Re: [VOTE] Release 0.14.1, release candidate #1

Re: [External] Current state of parquet zstd OOM with hudi

Re: [External] Current state of parquet zstd OOM with hudi

Re: [External] Current state of parquet zstd OOM with hudi

Current state of parquet zstd OOM with hudi

Tuning guide question about off-heap

Re: Improved MOR spark reader

Re: Improved MOR spark reader

Improved MOR spark reader

Re: Discuss fast copy on write rfc-68

Re: Discuss fast copy on write rfc-68

Discuss fast copy on write rfc-68

Re: Record level index with not unique keys

Record level index with not unique keys

Re: [DISCUSS] Hudi Reverse Streamer

Re: Calling for 0.13.1 Release

Re: [REVERT] [VOTE] Release 0.12.0, release candidate #1

Re: Updates on 0.11.1 release

Re: Updates on 0.11.1 release

Re: spark 3.2.1 built-in bloom filters

Re: spark 3.2.1 built-in bloom filters

spark 3.2.1 built-in bloom filters

Re: [ANNOUNCE] Apache Hudi 0.10.1 released

Re: Limitations of non unique keys

Re: Limitations of non unique keys

Re: feature request/proposal: leverage bloom indexes for readingb

Re: feature request/proposal: leverage bloom indexes for readingb

Limitations of non unique keys

feature request/proposal: leverage bloom indexes for readingb

29 matches

Site Navigation

Mail list logo

Footer information