Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-17 Thread Satish Kotha
l be retained, but instead of merging, you will just reuse > > the file names and write the incoming records into new file slices? > > You probably already thought of this, but one thing to watch out for is : > > we should generate a new file slice for every file group in a partition.. &

[DISCUSS] Insert Overwrite with snapshot isolation

2020-04-15 Thread Satish Kotha
Hello I want to discuss adding a new high level API 'insertOverwrite' on HoodieWriteClient. This API can be used to - Overwrite specific partitions with new records - Example: partition has 'x' records. If insert overwrite is done with 'y' records on that partition, the

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-05-08 Thread Satish Kotha
i, Apr 17, 2020 at 2:26 PM Vinoth Chandar wrote: > > > Thanks Satish! > > > > On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha > > > > > wrote: > > > > > Thanks for interesting discussion. I will start RFC as suggested and > > > discuss poi

Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-08 Thread Satish Kotha
Hi folks, Any thoughts on this? At a high level, we want to change high watermark commit through a property to perform pre-commit and post-commit hooks. Is this useful for anyone else? On Thu, Sep 3, 2020 at 11:12 AM Sanjay Sundaresan wrote: > Hello folks, > > We have a use case to make sure

Re: Query Incremental Updates on same primary key

2020-05-29 Thread Satish Kotha
Hi, > Now, when I query I get (1 | Mickey) but I never get (1 | Tom) as its in > old parquet file. So doesn't incremental query run on old parquet files ? > Could you share the command you are using for incremental query? Specific config is required by hoodie for doing incremental queries.

Re: Query Incremental Updates on same primary key

2020-05-29 Thread Satish Kotha
ut I can try again if you believe that incremental query scans through all > the parquet files and not just the latest one. > > On Fri, 29 May 2020 at 10:48 PM, Satish Kotha > > wrote: > > > Hi, > > > > > > > Now, when I query I get (1 | Mickey) but I never get (1 | To

[DISCUSS] querying commit metadata from spark DataSource

2020-05-29 Thread Satish Kotha
Hello folks, We have a use case to incrementally generate data for hudi table (say 'table2') by transforming data from other hudi table(say, table1). We want to atomically store commit timestamps read from table1 into table2 commit metadata. This is similar to how DeltaStreamer operates with

Re: Query Incremental Updates on same primary key

2020-05-30 Thread Satish Kotha
ll. > > I tried to find out in wiki but couldn't get much info so if you have some > link please provide it. > Thanks for all your help !! > > > > On 2020/05/29 18:55:39, Satish Kotha > wrote: > > Hello > > > > But I can try again if you believe that in

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread Satish Kotha
source name - > > spark.read.format(“hudi-timeline”).load(basepath). We can start by just > > allowing querying of active timeline and expand this to archive timeline? > > > > What do other Think? > > > > > > > > > > On Fri, May 29, 2020 at 2:37 PM

Re: [DISCUSS] querying commit metadata from spark DataSource

2020-06-01 Thread Satish Kotha
re accessed from > That's the public API for obtaining this information for Scala/Java Spark. > If you have a way of calling this from python through some bridge without > painful bridges (e.g jython), might be a tactical solution that can meet > your needs. > > On Mon, Jun 1, 2020 at 5:

Re: Deleting Hudi Partitons

2020-10-21 Thread Satish Kotha
Yes, that would work. You would typically add below option on dataframe to use insert overwrite (InsertOverwrite is a new API, I haven't updated documentation yet). - hoodie.datasource.write.operation: insert_overwrite Let me know if you have any questions. @Balaji Thanks for creating the

[Announce] Clustering feature available in beta

2020-12-22 Thread Satish Kotha
Hello all, Clustering feature landed on master branch and is available in beta. This feature can be used to do following 1) Stitch small files into larger files 2) Change data layout on disk by sorting data using different columns (for query/storage

Re: [Announce] Clustering feature available in beta

2021-01-20 Thread Satish Kotha
nks Satish On Tue, Dec 22, 2020 at 10:32 PM Vinoth Chandar wrote: > Please help us test this more, before RC is cut! :) > > On Tue, Dec 22, 2020 at 10:23 PM Satish Kotha > > wrote: > > > Hello all, > > > > Clustering feature landed <https://github.com/apa

Re: [VOTE] Release 0.7.0, release candidate #1

2021-01-21 Thread Satish Kotha
+1, 1) Able to build 2) Integration tests pass 3) Unit tests pass locally 4) Successfully ran clustering on a small dataset (metadata table not enabled) 5) Verified insert, upsert, insert_overwrite works using QuickStart commands on COW table (metadata table not enabled) On Thu, Jan 21, 2021

Re: [DISCUSS] Improve data locality during ingestion

2021-02-03 Thread Satish Kotha
are co-located in the same data/log file. Hopefully, this explains the idea better. Appreciate any feedback. On Mon, Feb 1, 2021 at 3:43 PM Satish Kotha wrote: > Hello, > > Clustering <https://hudi.apache.org/blog/hudi-clustering-intro/> is a > great feature for improving d

Re: [DISCUSS] Hash Index for HUDI

2021-06-02 Thread Satish Kotha
+1. You may want to read this thread as well. There are minor differences between these threads, but the high level idea is similar. On Wed, Jun 2, 2021

[DISCUSS] Improve data locality during ingestion

2021-02-01 Thread Satish Kotha
Hello, Clustering is a great feature for improving data locality. But it has a (relatively big) cost to rewrite the data after ingestion. I think there are other ways to improve data locality during ingestion. For example, we can add a new

Re: [DISCUSS] Improve data locality during ingestion

2021-02-18 Thread Satish Kotha
; > > > > > Today the upsert partitioner does the file sizing/bin-packing etc > for > > > > > inserts and then sends some inserts over to existing file groups to > > > > > maintain file size. > > > > > We can abstract all of this into strategies and some

Re: [VOTE] Release 0.12.2, release candidate #1

2022-12-22 Thread Satish Kotha
gt; https://github.com/apache/hudi/commit/c288a506d4c0b7c1272538d95928df118e4d79ac > > https://github.com/apache/hudi/commit/211af1a4fd76ce84ce80f4d1b2befe5fc9954888 > > Best, > Danny > > Satish Kotha 于2022年12月20日周二 11:50写道: > > > > small correction in the first line:

[RESULT] [VOTE] Release 0.12.2, release candidate #1

2022-12-24 Thread Satish Kotha
Hi everyone, I'm happy to announce that we have unanimously approved this release. There are 7 approving votes, 4 of which are binding. Here is the breakdown: +1 (binding) : 4 * Bhavani Sudha Saktheeswaran * Sivabalan Narayanan * Shiyan Xu * Balaji Varadarajan -1 (binding) : 0 +1

Re: [VOTE] Release 0.12.2, release candidate #1

2022-12-24 Thread Satish Kotha
; > > > > > > > > > > Checking for DISCLAIMER > > > > > > > > > > DISCLAIMER file exists ? [OK] > > > > > > > > > > > > > > > Checking for LICENSE and NOTICE > > > > > > > > &g

Apache Hudi 0.12.2 released

2022-12-27 Thread Satish Kotha
The Apache Hudi team is pleased to announce the release of Apache Hudi 0.12.2. Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and Incrementals. Apache Hudi manages storage of large analytical datasets on DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage) and

0.12.2 release code freeze date

2022-12-06 Thread Satish Kotha
Hi folks, We have a new proposal for 0.12.2 release code freeze date. I'm proposing this friday 12/9. Please reply back if you have any objections or if you are aware of any other blockers. Thanks Satish -

Re: [VOTE] Release 0.12.2, release candidate #1

2022-12-19 Thread Satish Kotha
small correction in the first line: Please review and vote on the release candidate #1 for the version 0.12.2, On Mon, Dec 19, 2022 at 6:37 PM Satish Kotha wrote: > Hi everyone, > > Please review and vote on the release candidate #1 for the version 0.12.1, > as follows: > &g

[VOTE] Release 0.12.2, release candidate #1

2022-12-19 Thread Satish Kotha
Hi everyone, Please review and vote on the release candidate #1 for the version 0.12.1, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes

Re: 0.12.2 release code freeze date

2022-12-08 Thread Satish Kotha
grade for 1.16.0 so unsure if dec 9th > > is possible for me to have this fully merged. > > > > Regards, > > Rahil > > > > > > On Tue, Dec 6, 2022 at 12:12 PM Satish Kotha > > wrote: > > > >> Hi folks, > >> > >>

Re: Calling for 0.12.2 RM

2022-11-17 Thread Satish Kotha
I am happy to take this up. On Thu, Nov 17, 2022 at 10:57 AM Sivabalan wrote: > Hey folks, > It's time for us to get started w/ 0.12.2 bug fix release. We can > target for code freeze by end of Nov and release by early Dec. Can we have > volunteers for the release manager to drive the

0.12.2 release timeline

2022-11-19 Thread Satish Kotha
Hi folks, As the RM for the upcoming 0.12.2 release, I'd like to propose code freeze for 0.12.2 on Nov 28, Monday, 11:59 PM PST Please raise any concerns if you have one by the end of day Nov 21. Please tag any JIRAs that you have planned for the 0.12.2 release by setting the "Fix Version/s" to