-1 (non binding)
ran our internal test suite on 0.14.1-rc1 and found 2 issues on hudi
third parties:
- datadog: https://github.com/apache/hudi/issues/10403
- dynamodb lock provider: https://github.com/apache/hudi/issues/10394
Proposed a PR for each.
On Sun, 2023-12-24 at 07:01 -0800, Sivabalan
We fixed the hudi memory leak by patching parquet 1.12 and rely on gradle to
overwrite the transitive dependencies of parquet with that latest version.
I would say an entry in the hudi FAQ on this issue would be great, since hard
to spot, and marked as fixed on spark side.
Also we didn't
Following up on this, only spark 3.5.x ships with fixed parquet version 0.13.x.
It's available for latest hudi 0.14 only.
If i replace parquet in previous version of spark i likely breaks the
reader/writers since methods have been changed in parquet.
Right now I will experiment with 3.5 and
gt; version and check if it is fixed w/O change anything in hudi
> From: "nicolas paris"
> Date: Mon, Nov 20, 2023, 20:07
> Subject: [External] Current state of parquet zstd OOM with hudi
> To: "Hudi Dev List"
> hey month ago someone spotted memory leak while reading
hey
month ago someone spotted memory leak while reading zstd files with
hudi
https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280
since then spark has merged fixes for 3.2.4, 3.3.3, 3.4.0
https://issues.apache.org/jira/browse/SPARK-41952
we are currently on spark 3.2.4, hudi
hi everyone,
from the tuning guide:
> Off-heap memory : Hudi writes parquet files and that needs good
amount of off-heap memory proportional to schema width. Consider
setting something like spark.executor.memoryOverhead or
spark.driver.memoryOverhead, if you are running into such failures.
can
OR; see draft RFC here: https://github.com/apache/hudi/pull/9235.
>Feel free to give feedback there.
>
>Best,
>- Ethan
>
>On Sat, Jul 22, 2023 at 1:23 PM Nicolas Paris
>wrote:
>
>> Just to clarify: the read path described is all about RT views here only,
>> not
Just to clarify: the read path described is all about RT views here only, not
related to RO.
On July 22, 2023 8:14:09 PM UTC, Nicolas Paris wrote:
>I have been playing with the starrocks MOR hudi reader recently and it does an
>amazing work: it has two read paths:
>
>1. For partiti
I have been playing with the starrocks MOR hudi reader recently and it does an
amazing work: it has two read paths:
1. For partitions with log files, use the merging logic
2. For partitions with only parquet files, use the cow read logic
As you know, the first path is slow bcoz it has merging
UTC, Nicolas Paris wrote:
>Spliting parquet file into 5 row groups, leads to same benefit as creating 5
>parquet files each 1 row group instead.
>
>Also the later can involve more parallelism for writes.
>
>Am I missing something?
>
>On July 20, 2023 12:38:54 PM UTC, sa
ing updates should not be that tricky.
>
>Regards,
>Sagar
>
>On Thu, Jul 20, 2023 at 3:26 PM nicolas paris
>wrote:
>
>> Hi,
>>
>> Multiple idenpendant initiatives for fast copy on write have emerged
>> (correct me if I am wrong):
>> 1.
>>
>> htt
Hi,
Multiple idenpendant initiatives for fast copy on write have emerged
(correct me if I am wrong):
1.
https://github.com/apache/hudi/blob/f1afb1bf04abdc94a26d61dc302f36ec2bbeb15b/rfc/rfc-68/rfc-68.md
2.
https://www.uber.com/en-FR/blog/fast-copy-on-write-within-apache-parquet/
The idea is to
30298, 0,
1689147210233}
]|
On Thu, 2023-07-13 at 10:17 -0700, Prashant Wason wrote:
> Hi Nicolas,
>
> The RI feature is designed for max performance as it is at a record-
> count
> scale. Hence, the schema is simplified and minimized.
>
> With non unique keys
hi there,
Just tested preview of RLI (rfc-08), amazing feature. Soon the fast COW
(rfc-68) will be based on RLI to get the parquet offsets and allow
targeting parquet row groups.
RLI is a global index, therefore it assumes the hudi key is present in
at most one parquet file. As a result in the
Hi any rfc/ongoing efforts on the reverse delta streamer ? We have a use case
to do hudi => Kafka and would enjoy building a more general tool.
However we need a rfc basis to start some effort in the right way
On April 12, 2023 3:08:22 AM UTC, Vinoth Chandar
wrote:
>Cool. lets draw up a RFC
Hi, any timeline for the 0.13.1 bugfix release ?
may that one be added to the prep branch
https://github.com/apache/hudi/pull/8432
On Thu, 2023-03-09 at 11:21 -0600, Shiyan Xu wrote:
> thanks for volunteering! let's collab on the release work
>
> On Sun, Mar 5, 2023 at 8:16 PM Forward Xu
>
Hi dev team,
I take this opportunity to also propose to land this tiny fix which
lead us not to use the spark-bundle due to conflicts with other libs.
https://github.com/apache/hudi/pull/6874
In any case, thanks !
On Fri, 2022-10-07 at 18:43 +0800, Shiyan Xu wrote:
> Thank you, Zhaojing, for
Thanks to the community support, I have closed that issue, and
commenting the reason.
glad to see 0.11.1 soon
On Fri Jun 10, 2022 at 11:33 AM CEST, Nicolas Paris wrote:
> Hi team
>
> I likely spotted a blocker issue with the incremental cleaning service
> which is a blocker
Hi team
I likely spotted a blocker issue with the incremental cleaning service
which is a blocker on our side to scale cleaning on large tables.
See https://github.com/apache/hudi/issues/5835
Please tell me if my email does not respect the release process
On Wed Jun 8, 2022 at 1:39 AM CEST, Y
gt; engines for point-ish lookups.
>
> Hope that helps
>
> Thanks
> Vinoth
>
>
>
>
> On Mon, Mar 28, 2022 at 9:57 AM Nicolas Paris
> wrote:
>
> > Hi,
> >
> > spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
> > arbirtr
sm is stable, we plan to stop writing out
> bloom
> filters in parquet and also integrate the Hudi MDT with different
> query
> engines for point-ish lookups.
>
> Hope that helps
>
> Thanks
> Vinoth
>
>
>
>
> On Mon, Mar 28, 2022 at 9:57 AM Nicolas
Hi,
spark 3.2 ships parquet 1.12 which provides built-in bloom filters on
arbirtrary columns. I wonder if:
- hudi can benefit from them ? (likely in 0.11, but not with MOR tables)
- would make sense to replace the hudi blooms with them ?
- what would be the advantage of storing our blooms in
congrats
what about also posting releases into the apache announce mailing list
annou...@apache.org
On Fri Jan 28, 2022 at 1:39 PM CET, Sivabalan wrote:
> The Apache Hudi team is pleased to announce the release of Apache
>
> Hudi 0.10.1.
>
>
> Apache Hudi (pronounced Hoodie) stands for Hadoop
any column. not just the key.
> In another words, we are generalizing this so hudi feels more like MySQL
> and not HBase/Cassandra (key value store). Thats the direction we are
> approaching.
>
> love to hear more feedback.
>
> On Tue, Nov 2, 2021 at 2:29 AM Nicolas Paris
> wro
for example does the move of blooms into hfiles (0.10.0 feature) makes
unique bloom keys mandatory ?
On Thu Oct 28, 2021 at 7:00 PM CEST, Nicolas Paris wrote:
>
> > Are you asking if there are advantages to allowing duplicates or not having
> > keys in your table?
> it's
o
the hudi:
df_hudi_keys.options(**hudi_options).save(...)
Then a full featured / documented hoodie client is maybe the best option
thought ?
On Thu Oct 28, 2021 at 2:34 PM CEST, Vinoth Chandar wrote:
> Sounds great!
>
> On Tue, Oct 26, 2021 at 7:26 AM Nicolas Paris
> wrote:
>
>
rowse/HUDI-1295
>
> Please let us know if you are interested in testing that when the PR is
> up.
>
> Thanks
> Vinoth
>
> On Tue, Oct 19, 2021 at 4:38 AM Nicolas Paris
> wrote:
>
> > hi !
> >
> > In my use case, for GDPR I have to export all informations of a
Hi devs,
AFAIK, hudi has been designed to have primary keys in the hudi's key.
However it is possible to also choose a non unique field. I have listed
several trouble with such design:
Non unique key yield to :
- cannot delete / update a unique record
- cannot apply primary key for new sql
hi !
In my use case, for GDPR I have to export all informations of a given
user from several hudi HUGE tables. Filtering the table results in a
full scan of around 10 hours and this will get worst year after year.
Since the filter criteria is based on the bloom key (user_id) it would
be handy to
29 matches
Mail list logo