Re: Site documentation

2019-01-24 Thread nishith agarwal
I vote for #3 as well. Yes, I'll take up the lead on this. BTW, why is creating a separate branch (asf-site) so popular ? I see that many projects have done that. Thanks, Nishith On Thu, Jan 24, 2019 at 2:13 PM Anbu Cheeralan wrote: > I prefer #3 that will keep the documentation in-sync with

Re: how to merge small parqut files in the hudi location

2019-04-04 Thread nishith agarwal
Rahul, Please make sure you are also setting the following config : "hoodie.cleaner.policy" -> This config supports 2 policies : KEEP_LATEST_FILE_VERSIONS, KEEP_LATEST_COMMITS (This is the default policy) If you are cleaning based on latest file versions, please set the policy to

[IMP] Understanding present state and planning ahead

2019-04-04 Thread nishith agarwal
All, Thanks for being active members of the Hudi community and for using (or contributing) to the project! As the project evolves, it's imperative to understand the key features that will help us all in making Hudi a more usable project. Specific issues may surface up on the mailing lists and/or

Re: Global Index for partitioned Hudi datasets

2019-03-28 Thread nishith agarwal
Here is the HIP : https://docs.google.com/document/d/1RdxVqF60N9yRUH7HZ-s2Y_aYHLHb9xGrlRLK1OWtYKM/edit?usp=sharing @Vinoth Chandar @balaji added you guys as approvers, please take a look. -Nishith On Tue, Mar 26, 2019 at 9:47 PM nishith agarwal wrote: > JIRA : https://issues.apache.org/j

Global Index for partitioned Hudi datasets

2019-03-26 Thread nishith agarwal
All, Currently, Hudi supports partitioned and non-partitioned datasets. A partitioned dataset is one which bucketizes groups of files (data) into buckets called partitions. A hudi dataset may be composed of N number of partitions with M number of files. This structure helps canonical

Re: Global Index for partitioned Hudi datasets

2019-03-26 Thread nishith agarwal
JIRA : https://issues.apache.org/jira/projects/HUDI/issues/HUDI-53 -Nishith On Tue, Mar 26, 2019 at 9:21 PM nishith agarwal wrote: > All, > > Currently, Hudi supports partitioned and non-partitioned datasets. A > partitioned dataset is one which bucketizes groups of

Re: Design documentation tools

2019-02-24 Thread nishith agarwal
ound writing a new proposal.. > This is what you were originally trying to do, IIRC? > > On Fri, Feb 22, 2019 at 12:40 PM nishith agarwal > wrote: > > > Great! Thanks Thomas. > > > > Vinoth, > > Can we use our existing hoodie google drive and share it with @dev ?

Re: Design documentation tools

2019-02-28 Thread nishith agarwal
ine the HIP template and the process, > > document , push the site out.. > > Then we can use the GlobalIndex doc to test drive the process. > > Are there other specific things you want to test out on cWiki itself? > > > > On Sun, Feb 24, 2019 at 10:09 AM nishith agarwal &

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

2019-02-26 Thread nishith agarwal
Hi Kaka, Hudi automatically does file sizing for you. As you ingest more inserts the existing file will be automatically sized. You can play with a few configs : https://hudi.apache.org/configurations.html#withStorageConfig -> This config allows you to set a max size for your output file.

Re: On MergeOnRead mode, when a record update more than once in a parttition, it does not work.

2019-02-26 Thread nishith agarwal
Thanks for pointing that out Kaka, I think HoodieAvroPayload is assigned to be the default class hence the confusion. You could implement your own payload class to achieve this or take a look at

Re: Podling Report Reminder - March 2019

2019-03-04 Thread nishith agarwal
> Please update that page anyways: > > https://svn.apache.org/repos/asf/incubator/public/trunk/content/projects/hudi.xml > > > On Mon, Mar 4, 2019 at 7:09 PM nishith agarwal > wrote: > > > Thomas, > > Do you have any idea why my name doesn't show up on the PPMC

Re: Podling Report Reminder - March 2019

2019-03-04 Thread nishith agarwal
Thomas, Do you have any idea why my name doesn't show up on the PPMC list ? -Nishith On Mon, Mar 4, 2019 at 7:04 PM Thomas Weise wrote: > Great! I made a few minor tweaks and signed off. > > Please also take a look at https://incubator.apache.org/projects/hudi.html > and see if anything needs

Re: Design documentation tools

2019-02-22 Thread nishith agarwal
rom mobile > > > > > > On Mon, Feb 18, 2019, 5:51 PM Vinoth Chandar > wrote: > > > > > > > Bumping this thread up. We have a few big designs upcoming.. > > > > Love to get this going, using this new setup. > > > > > > >

Re: JIRA setup

2019-03-05 Thread nishith agarwal
My JIRA ID : *nishith29* -Nishith On Tue, Mar 5, 2019 at 1:36 PM Vinoth Chandar wrote: > I just started on setting up more things on JIRA.. > > Can PMC members respond with your jira id, so I can setup perms correctly? > > Thanks > Vinoth >

Re: Is HoodieParquetWriter support snappy compression?

2019-03-14 Thread nishith agarwal
I think this was brought up earlier as well +Vinoth Chandar Frank, The HoodieParquetWriter uses gzip compression by default. To enable snappy, we just need to add a config to be able to pass a custom compression format and set that value. I'd be happy to take a PR around this, would you mind

Re: Insert will generate at least one file each time when each spark or spark streaming batch?

2019-03-11 Thread nishith agarwal
lso please use the insert api/operation, (not bulk_insert) if you want > > this behavior. > > > > Let us know if you still run into issues.. > > > > On Tue, Feb 26, 2019 at 11:09 PM kaka chen > wrote: > > > > > Thanks! > > > > > > nishith

Re: Failed to initialize HoodiStorageWriter

2019-03-08 Thread nishith agarwal
spark partitions/tasks. Please > correct me if I am wrong. Thanks. > > On Sat, Mar 9, 2019 at 1:46 AM nishith agarwal > wrote: > > > Umesh, > > > > This issue still persists. Could you please use num-cores = 1 ? You can > > scale out using num-executors. > >

Re: Failed to initialize HoodiStorageWriter

2019-03-08 Thread nishith agarwal
t; so as per you I can use 4 executor with one core each processing 4 parquet > files at a time and wasting unnecessarily parallel cores?? You getting me > what I am trying to explain. > > On Sat, Mar 9, 2019, 2:33 AM nishith agarwal wrote: > > > Umesh, > > > >

Re: Failed to initialize HoodiStorageWriter

2019-03-08 Thread nishith agarwal
Umesh, This issue still persists. Could you please use num-cores = 1 ? You can scale out using num-executors. -Nishith On Fri, Mar 8, 2019 at 12:06 PM Umesh Kacha wrote: > I think issue is this https://github.com/uber/hudi/issues/227 I get the > same error and I tried to use multiple executor

Re: Hudi + Alluxio

2019-03-18 Thread nishith agarwal
Brandon, This sounds great! We've also thought around similar lines of hot % vs cold % all accessible through a unified query interface. Let us know how your POC turns out. -Nishith On Mon, Mar 18, 2019 at 1:04 PM Brandon Geise wrote: > Will do on the JIRA. I need to do a POC to see if it's

Request to subscribe/join ML

2019-03-11 Thread nishith agarwal

Re: Site documentation

2019-02-14 Thread nishith agarwal
> > all > >> > >> > >> > > >> > > > it > >> > >> > >> > > >> > > > > > > > should take, right? > >> > >> > >> > > >> > > > > > > > > >> > >> > >> > > >> > > > > > > Not quite, u ne

Re: Design documentation tools

2019-02-13 Thread nishith agarwal
; > > u might want to look at how Kafka KIPs and Flink FLIPs r > done > > > to > > > > > see > > > > > > > how > > > > > > > > > other projects r doing it now and its been working well > >

Re: Name search process

2019-01-30 Thread nishith agarwal
Nice! On Wed, Jan 30, 2019 at 12:06 PM Vinoth Chandar wrote: > Awesome! > > On Wed, Jan 30, 2019 at 11:37 AM Balaji Varadarajan > wrote: > > > > > Hi All, > > > > I have completed the name-search and filed a JIRA here : > > https://issues.apache.org/jira/browse/PODLINGNAMESEARCH-162 > > > >

Design documentation tools

2019-02-03 Thread nishith agarwal
Hi All, I wanted to start a discussion around contributing and collaborating on design for large features. I've noticed that for some of the projects such as apache beam, google docs is chosen as a means to collaborate on design documents and large changes (Find it here

Re: Site documentation

2019-02-03 Thread nishith agarwal
iterate on it and come up with a plan as we migrate the codebase but I'd love to hear your thoughts around this. Thanks, Nishith On Thu, Jan 31, 2019 at 2:38 PM nishith agarwal wrote: > Yes, I'll get to it later tonight. > > Thanks, > Nishith > > On Thu, Jan 31, 2019 at

Re: Last commit id/ts checkpoint for incremental pull

2019-05-15 Thread nishith agarwal
Hey Roshan, The incremental view works on 2 basic principles : 1) You can read updates per batch of data that you commit, so for eg. you have commits c1, c2, c3 & c4. You can incrementally read any range of commits. 2) You can read updates as long as you keep those many versions around. These

Upgrade HUDI to Hive 2.x

2019-05-17 Thread nishith agarwal
All, Is anyone using Hudi with Hive 1.x ? Currently, Hudi has a dependency on Hive 1.x and works against Hive 2.x by using specific profiles. There are non-backwards compatible changes in the HiveRecordReader for Hive 1.x vs Hive 2.x. I'm planning to upgrade to Hive 2.x which would essentially

Re: Read RO table in Spark as hive table | No records returned

2019-05-29 Thread nishith agarwal
Thanks for the details Satish. 1. I am using the Multi partition data - There is an option in HiveSyncTool to use multi-partitioning : --partition-value-extractor which you can use to have a custom partition extractor. 2 As I am using spark ,I can get the partition added by querying on dataset

Re: Schema compatibility

2019-06-25 Thread nishith agarwal
Hi Katie, Thanks for explaining the problem in detail. Could you give us some more information before I can help you with this ? 1. What table type are you using - COPY_ON_WRITE or MERGE_ON_READ ? 2. Could you paste the exception you see in Hudi ? 3. "Despite the schema having full

Re: DISCUSS HUDI-106 Dynamic bloom filters

2019-05-01 Thread nishith agarwal
That's a good pointer. Let me take this up and look into it. -Nishith On Sat, Apr 27, 2019 at 10:55 PM Vinoth Chandar wrote: > > https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/util/bloom/DynamicBloomFilter.html > is > something a team mate pointed to me. > Cannot find anything

Re: Smoothening out skews in bloom checks

2019-05-03 Thread nishith agarwal
context > > On Fri, May 3, 2019 at 9:21 AM nishith agarwal > wrote: > > > Nice, we needed to take a fresher look at the indexing stages, great > start! > > The results look promising. Looks like the Min 24th percentile bumped but > > that's expected si

Re: Out of Heap Error when inserting into Hudi dataset

2019-07-02 Thread nishith agarwal
Kabir, Could you share the content of your commit metadata ? You can list the timeline, find the latest commit in the timeline, perform a cat and paste the results (that you can share). Thanks, Nishith On Tue, Jul 2, 2019 at 4:53 PM Kabeer Ahmed wrote: > Hi Vinoth and other HUDI Experts, > >

Re: [VOTE] Release 0.5.0-incubating, release candidate #1

2019-09-14 Thread nishith agarwal
+1 (binding) Amazing job Balaji! checksums [success] verified signatures [success] mvn clean install -DskipTests [success] Thanks, Nishith On Sat, Sep 14, 2019 at 7:39 PM vbal...@apache.org wrote: > Good point Vinoth. It looks like this is needed for incubator projects. > Let me go ahead and

Re: [DISCUSS] Refactor the package name of Hudi

2019-08-07 Thread nishith agarwal
+1 Efforts are already under way for package renaming but has to be done carefully to avoid repeated work and conflicts. Balaji is taking care of this I believe. - Nishith On Wed, Jul 31, 2019 at 5:06 AM Vinoth Chandar wrote: > +1 > > My. suggestion would be to follow how it is, when you make

Re: [QUESTION] May I ask if the Hudi contributor JIRA group can receive the notification email.

2019-08-07 Thread nishith agarwal
sounds good to me! On Wed, Aug 7, 2019 at 4:42 AM Vinoth Chandar wrote: > Alright.. May be give this two more days per > https://community.apache.org/committers/lazyConsensus.html ? > Once we deem there is enough support to do this and no one objects, we can > engage on an INFRA ticket > > On

Re: [DISCUSS] Decouple Hudi and Spark

2019-08-06 Thread nishith agarwal
+1 for Approach 1 Point integration with each framework. Pros for point integration - Hudi community is already familiar with spark and spark based actions/shuffles etc. Since both modules can be decoupled, this enables us to have a steady release for Hudi for 1 execution engine (spark) while we

Re: [VOTE] Proposal to clone default JIRA workflow for Hudi project

2019-07-22 Thread nishith agarwal
+1 sounds good. On Mon, Jul 22, 2019 at 12:19 PM Bhavani Sudha Saktheeswaran wrote: > +1 > > On Mon, Jul 22, 2019 at 11:12 AM Vinoth Chandar wrote: > > > Hello all, > > > > Pursuant to > > >

DISCUSS RFC 7 - Point in time queries on Hudi table (Time-Travel)

2019-11-11 Thread nishith agarwal
Folks, Starting a discussion thread for enabling time-travel for Hudi datasets. Please provide feedback on the RFC here . Thanks, Nishith

Re: [Discuss] Convenient time for weekly sync meeting

2019-11-11 Thread nishith agarwal
Vinoth, To meet mid way, how about once in 3 weeks for Europe and other time zones ? That works fine for me. In the interest of making the meetings useful for everyone, we can see how productive the meetings are/% attendance for the meetings for the initial few ones, and then may be we can follow

Re: Write Streaming data using Datasource Writer is not working

2019-10-30 Thread nishith agarwal
nstead of using > DataStreamWriter. I think Datasource Writer also can support write > streaming data, correct? > > Best, > Qian > On Oct 28, 2019, 9:31 PM -0700, nishith agarwal , > wrote: > > Qian, > > > > It seems like you are using the > > > https://spark.

Re: [DISCUSS] Simplification of terminologies

2019-11-12 Thread nishith agarwal
+1 on the first two, don't feel strongly about (3). Thanks, Nishith On Tue, Nov 12, 2019 at 5:03 AM leesf wrote: > [1] +1. `views` indeed confused me a lot. > [2] +1. `snapshot` is more reasonable. > [3] I don't feel very strong to rename it, the current name `COPY_ON_WRITE` > is reasonable

Re: [DISCUSS] New RFC? Hudi dataset snapshotter

2019-11-12 Thread nishith agarwal
+1 on the exporter tool idea. -Nishith On Tue, Nov 12, 2019 at 5:06 AM leesf wrote: > +1. and we would discuss it further when design docs are available. > > Best, > Leesf > > Balaji Varadarajan 于2019年11月12日周二 下午4:17写道: > > > +1 on the exporter tool idea. > > > > On Mon, Nov 11, 2019 at 10:36

Re: EMR + HUDI

2019-11-15 Thread nishith agarwal
This is great! AWS is one of the most reliable platforms out there and this integration enables more folks use Hudi on such a platform. Thank you all for making this happen. (Hoping to get feedback from all the potential new users to make Hudi even more powerful and useful :)) -Nishith On Fri,

Re: Error while running Hive Sync (hoodie-0.4.7)

2019-10-15 Thread nishith agarwal
Gurudatt, Hudi master moved away from Hive 1.x to Hive 2.x. However, can you try to build master and provide a connection URL to the Hive 1.x metastore/hive server ? Thanks, Nishith On Tue, Oct 15, 2019 at 9:11 AM Vinoth Chandar wrote: > Ouch.. We dropped support for Hive 1.x recently. But

Re: Write Streaming data using Datasource Writer is not working

2019-10-28 Thread nishith agarwal
Qian, It seems like you are using the https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/sql/streaming/DataStreamWriter.html and not the spark DataSource. To use the spark datasource, look at an example here https://hudi.apache.org/writing_data.html#datasource-writer.

Re: [VOTE] Release 0.5.0-incubating, release candidate #6

2019-10-16 Thread nishith agarwal
+1 (binding) Thanks, Nishth On Wed, Oct 16, 2019 at 8:00 PM vino yang wrote: > +1 (non-binding) > > - checked release note > > Best, > Vino > > leesf 于2019年10月17日周四 上午7:07写道: > > > +1 (non-binding) > > > > Best, > > Leesf > > >

Re: 20191119 Weekly Meeting

2019-11-19 Thread nishith agarwal
Thanks Bhavani! -Nishith On Tue, Nov 19, 2019 at 10:26 PM Bhavani Sudha wrote: > Please find the meeting summary here - > https://cwiki.apache.org/confluence/x/OxYZC > > Thanks, > Sudha > > On Tue, Nov 19, 2019 at 9:06 PM Vinoth Chandar wrote: > > > Hangout link here > >

Re: Questions about using Hudi

2019-10-08 Thread nishith agarwal
Qian, It looks like the partitionPathField that you specified (session_date) is missing or the code is unable to grab it from your payload. Is this field a top-level field or a nested field in your schema ? ( Currently, the HDFSImporterTool looks for your partitionPathField only at the top-level,

Re: Questions about using Hudi

2019-10-08 Thread nishith agarwal
ta”,”fields”:[ > > {“name”:”SESSION_DATE”,”type”:”string”}, > > {“name”:”SITE_ID”,”type”:”int”}, > > {“name”:”GUID”,”type”:”string”}, > > {“name”:”SESSION_KEY”,”type”:”long”}, > > {“name”:”USER_ID”,”type”:”string”}, > > {“name”:”STEP”,”type

Re: [VOTE] Release 0.5.0-incubating, release candidate #5

2019-10-06 Thread nishith agarwal
+1 (binding) - verified checksums and signatures [SUCCESS] - verified RAT check [SUCCESS] - built from source release (mvn clean install -DskipTests) [SUCCESS] - ran local docker tests [SUCCESS] Thanks, Nishith On Sat, Oct 5, 2019 at 10:03 PM Bhavani Sudha Saktheeswaran wrote: > +1

Re: Questions about using Hudi

2019-10-11 Thread nishith agarwal
2Fhow-to-validate-contents-of-spark-dataframe=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D). > This will show you if there is any value in any column that is against the > declared schema type. And when you fix that, the errors will go away. > > > > Keep us posted on how you get along with this

Re: [QUESTION] Encountering exceptions while upserting with Deltastreamer

2019-12-19 Thread nishith agarwal
Ethan, Unless this is a backwards incompatible schema change, this seems related to a parquet-avro reader bug we've seen before, find more details here : https://issues.apache.org/jira/projects/PARQUET/issues/PARQUET-1656?filter=allopenissues . There's a fix for the parquet-avro reader for 1.8.1

Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-19 Thread nishith agarwal
Great job Lamber! The website looks really slick and has a much better experience of moving from one page to another (mostly I think because it's faster), also find it the text much more conducive to absorb. While going through the quick start, I noticed that under the highlighted box in dark

Re: [QUESTION] Encountering exceptions while upserting with Deltastreamer

2019-12-19 Thread nishith agarwal
build Hudi utilities bundle to > pick that up. > > On Thu, Dec 19, 2019 at 1:51 PM nishith agarwal > wrote: > > > Ethan, > > > > Unless this is a backwards incompatible schema change, this seems > > related to a parquet-avro reader bug we've seen before, find more de

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-24 Thread nishith agarwal
+100 - Reduces index lookup time hence improves job runtime - Paves the way for streaming style ingestion - Eliminates dependency on Hbase (alternate "global index" support at the moment) -Nishith On Mon, Feb 24, 2020 at 10:56 AM Vinoth Chandar wrote: > +1 from me as well. This will be a

Re: Hudi Metrics Template

2020-01-23 Thread nishith agarwal
Syed, I don't think there is one. You can take a look at this class -> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/metrics/HoodieMetrics.java#L138 to find the different metrics being published. Adding a template will be useful! Thanks, Nishith

Re: Re: [DISCUSS] Redraw of hudi data lake architecture diagram on langing page

2020-01-23 Thread nishith agarwal
+1 looks great Nit : I see that the old diagram has "Raw Ingest Tables" vs the new one "Row Ingest Tables". IMO, "Raw Ingest Tables" sounds more logical. -Nishith On Thu, Jan 23, 2020 at 10:57 AM Vinoth Chandar wrote: > +1. on that :) > > On Thu, Jan 23, 2020 at 10:22 AM hmatu

Re: [DISCUSS] Next Apache Release(0.5.2)

2020-02-18 Thread nishith agarwal
+1 on minor release focussing on Apache compliance. +1 on Vino yang to be Release Manager. -Nishith On Tue, Feb 18, 2020 at 11:53 AM vbal...@apache.org wrote: > > +1 on minor release focussing on Apache compliance. > +1 on Vino yang to be Release Manager. > The compliance issues reported on

Re: Please welcome our new PPMCs and Committer

2020-02-14 Thread nishith agarwal
Congratulations folks! Great job, keep it coming! -Nishith On Fri, Feb 14, 2020 at 3:43 PM Shiyan Xu wrote: > Congrats! Very well deserved! > > On Fri, 14 Feb 2020, 13:11 vbal...@apache.org, wrote: > > > Congratulations to Leesf, Vino Yang and Siva. > > +1 Very well deserved :) Looking

Re: [DISCUSS] Delay code freeze date for next release until Jan 19th (Sunday)

2020-01-15 Thread nishith agarwal
+1, sunday sounds good. -Nishith On Wed, Jan 15, 2020 at 9:08 AM Balaji Varadarajan wrote: > +1 Sunday should give breathing space to fix the blockers. > Balaji.V > On Wednesday, January 15, 2020, 06:50:28 AM PST, Vinoth Chandar < > vin...@apache.org> wrote: > > +1 from me. I feel sunday

Re: Re: IDE setup for code formatting

2019-12-23 Thread nishith agarwal
Vinoth, +1 on automating the manual work required at the moment to fix the checkstyle errors. I think if we are able to use spotless and at the same time know upfront all the things that would require manual work, there are few options IMO : a) Have a template of steps that can easily fix it ->

Re: How to upsert on HDFS

2019-12-24 Thread nishith agarwal
Hello Mayu, To query Hudi tables, you need to register it against the correct HudiInputFormat. For more information on querying the hudi table, please read the following documentation : https://hudi.apache.org/querying_data.html. The duplicates might be happening due to the absence of the

Re: [QUESTION] Encountering exceptions while upserting with Deltastreamer

2019-12-24 Thread nishith agarwal
; > > , > > > { "name": "namespace", "type": "string" } > > > , > > > > > > { "name": "version", "type": "long" } > > > ], > > > "name": "master_cluste

Re: Re: Re:Re: Re: Re:Re: Re: Re: [DISCUSS] Rework of new web site

2019-12-24 Thread nishith agarwal
ontent, so we need it. There are several > ways to present it, for examples. > > best, > lamber-ken > > > At 2019-12-20 05:57:16, "nishith agarwal" wrote: > >Great job Lamber! > > > >The website looks really slick and has a much better experience of mov

Re: [QUESTION] Encountering exceptions while upserting with Deltastreamer

2019-12-24 Thread nishith agarwal
sounds good! -Nishith On Tue, Dec 24, 2019 at 11:42 PM Sivabalan wrote: > I plan to help Ethan investigate this issue. Will keep you posted. > > On Tue, Dec 24, 2019 at 11:30 PM nishith agarwal > wrote: > > > Kabir, > > > > Here is a good resource to qu

Re: Re: How to upsert on HDFS

2019-12-24 Thread nishith agarwal
...@bonc.com.cn wrote: > Thank you for answering my question > I found that this is not a problem with HDFS, but that the number of > records can be upsert when 1 million, and it will have this problem when 10 > million > > > > ma...@bonc.com.cn > > From: nishith agarwal >

Re: [DISCUSS] Insert Overwrite with snapshot isolation

2020-04-21 Thread nishith agarwal
+1, thanks for starting this effort Satish! -Nishith On Fri, Apr 17, 2020 at 2:26 PM Vinoth Chandar wrote: > Thanks Satish! > > On Fri, Apr 17, 2020 at 11:32 AM Satish Kotha > > wrote: > > > Thanks for interesting discussion. I will start RFC as suggested and > > discuss points brought up in

Re: [DISCUSS] enable cross AZ consistency and quality checks of hudi datasets

2020-09-10 Thread nishith agarwal
+1 on the proposal. Thanks Sanjay for describing this in detail. This feature can also help in eliminating file listing completely from HDFS for hudi metadata information for use-cases that are very sensitive to file listing. Thanks, Nishith On Wed, Sep 9, 2020 at 4:59 PM Vinoth Chandar wrote:

Re: [DISCUSS] New Community Weekly Sync up Time

2020-09-15 Thread nishith agarwal
The current time suits well for me personally as well. But I'm fine with 8-9 pm if that helps accommodate other folks. Thanks, Nishith On Tue, Sep 15, 2020 at 7:29 AM Bhavani Sudha wrote: > The current time suited well for me personally. > Moving that to 1 hour earlier should be okay mostly. I

Re: [DISCUSS] Planning for Releases 0.6.1 and 0.7.0

2020-09-24 Thread nishith agarwal
Yes, we have some ideas around schema evolution and have discussed with Balaji before as well. I'm going to put these thoughts down and share it on the cWiki for all of us to jam. Realistically, I don't think we can hit in 0.7.0. We already have a pretty strong list of items for 0.7.0. Spark 3

Re: Hudi Concurrent Ingestion with Spark Streaming

2020-09-17 Thread nishith agarwal
Great! -Nishith On Thu, Sep 17, 2020 at 10:28 AM tanu dua wrote: > Thank you so much Nisheth. I understand now how it’s going to work. > > On Wed, 16 Sep 2020 at 11:15 PM, nishith agarwal > wrote: > > > Tanu, > > > > > > > > I'm assuming you're

Re: Hudi Concurrent Ingestion with Spark Streaming

2020-09-16 Thread nishith agarwal
Tanu, I'm assuming you're talking about multiple kafka partitions from a single Spark Streaming job. In this case, your job can read from multiple partitions but at the end, this data should be written to a single table. The dataset/rdd resulting from reading multiple partitions is passed as a

Re: PSA: master integ-tests failing

2020-08-03 Thread nishith agarwal
ed > by > > skipTests property. > > had a fix for changing to skipUTs > > https://github.com/apache/hudi/pull/1897/files > > > > On Fri, Jul 31, 2020 at 8:21 PM nishith agarwal > > wrote: > > > > > All, > > > > > > I've added new

Re: PSA: master integ-tests failing

2020-07-31 Thread nishith agarwal
All, I've added new log4j properties to the docker setup to limit the spark logs from the spark driver. Master should be stable. One thing I noticed during this is that the class `HoodieSparkSqlWriter` also runs as part of the integration tests which it should not, thus adding to the logs.

Re: [DISCUSS] Organizing ourselves for scale

2020-07-14 Thread nishith agarwal
+1 on high level roles as well as spreading PMCs to those roles. Going forward, it will be good to have PMC members overseeing different aspects of the community to help guide and provide feedback in a timely manner without overwhelming 1 person. Thanks, Nishith On Tue, Jul 14, 2020 at 9:02 AM

Re: [ANNOUNCE] Apache Hudi 0.5.3 released

2020-06-17 Thread nishith agarwal
Great job Siva and Sudha, thanks for driving this! -Nishith On Wed, Jun 17, 2020 at 7:16 PM wrote: > Super news :) The very first release after graduation. Awesome job Siva > and Sudha for spearheading the release of 0.5.3. > Balaji.V > > Sent from Yahoo Mail for iPhone > > > On Wednesday,

Re: [VOTE] Release 0.5.3, release candidate #2

2020-06-11 Thread nishith agarwal
+1 (binding) - Ran tests locally - Release script successful Checking Signature Signature Check - [OK] Checking for binary files in source release No Binary Files in Source Release? - [OK] Checking for DISCLAIMER DISCLAIMER file exists ? [OK] Checking for LICENSE and NOTICE License

Re: [DISCUSS] Hyperspace + Hudi

2020-07-27 Thread nishith agarwal
Thanks Vinoth for kicking off this thread. I have also been looking into hyperspace and is definitely an interesting project. On exploring the project, I found the following in addition to what you mentioned - Super easy to use, has a simple API to integrate into a spark based application -

Re: [DISCUSS] Hyperspace + Hudi

2020-07-27 Thread nishith agarwal
> indexing optimizations based on the created index > > This is very interesting. Could you expand more? One day, love to support > point(ish) lookups on. Hudi tables :) > > On Mon, Jul 27, 2020 at 8:29 AM nishith agarwal > wrote: > > > Thanks Vinoth for kicking off thi

Re: [DISCUSS] 0.7.0 release timelines

2020-12-01 Thread nishith agarwal
I vote for option 2 as well. -Nishith On Tue, Dec 1, 2020 at 10:05 PM Bhavani Sudha wrote: > I vote for option 2 too. > > On Tue, Dec 1, 2020 at 7:36 PM Sivabalan wrote: > > > I would vote for Option2 given that features are already being tested. if > > it's half way through development, may

Re: Accomplishments and Roadmap.

2020-12-14 Thread nishith agarwal
+1 I like the idea of putting a "2020 journey" blog and sharing it in an online Apache Hudi Tech Talk session with the community. Additionally, if anyone is interested in showcasing a production use-case they've been running, I'm going to host a meetup soon, please reach out to me. Thanks,

[DISCUSS] Parallel writing to Hudi tables

2020-12-13 Thread nishith agarwal
Folks, There have been many requests from users around supporting concurrency control for Hudi tables. I'm proposing we break down this ask into 2 phases. The first phase will focus on providing the ability to perform parallel writes to Hudi tables - this means as long as writes touch

Re: Reg weekly sync meeting

2020-11-02 Thread nishith agarwal
+1 On Mon, Nov 2, 2020 at 9:05 AM Sivabalan wrote: > +1 > > On Mon, Nov 2, 2020 at 11:57 AM Vinoth Chandar wrote: > > > +1 > > > > On Mon, Nov 2, 2020 at 8:44 AM Balaji Varadarajan > > wrote: > > > > > +1 > > > On Sunday, November 1, 2020, 09:13:44 PM PST, Gary Li < > > >

Re: [VOTE] Release 0.7.0, release candidate #2

2021-01-22 Thread nishith agarwal
+1 binding - Build Successful - Release validation script Successful - Quick start runs Successfully Checking Checksum of Source Release Checksum Check of Source Release - [OK] % Total% Received % Xferd Average Speed TimeTime Time Current Dload

Re: [VOTE] Release 0.7.0, release candidate #1

2021-01-21 Thread nishith agarwal
+1 binding - Build Successful - Release validation script Successful - Quick start runs Successfully ./release/validate_staged_release.sh --release=0.7.0 --rc_num=1 /tmp/validation_scratch_dir_001 ~/hoodie-0.7/hudi/scripts Downloading from svn co https://dist.apache.org/repos/dist//dev/hudi

Re: Congrats to our newest committers!

2021-01-27 Thread nishith agarwal
Congratulations to both! -Nishith On Wed, Jan 27, 2021 at 11:49 AM Sivabalan wrote: > Congratulations folks ! > > On Wed, Jan 27, 2021 at 12:48 PM Pratyaksh Sharma > wrote: > > > Congratulations both of you! > > > > On Wed, Jan 27, 2021 at 8:43 PM Vinoth Chandar > wrote: > > > > > Congrats

Re: support for out of order processing

2021-01-30 Thread nishith agarwal
Anton, Yes, you can achieve this with Hudi. Hudi uses a HoodieRecordPayload implementation to be able to merge old and new records. You can define a source ordering field (here "sort_key") to govern which record should be chosen as the latest one. The DefaultHoodieRecordPayload supports this ->

Technical Issues with Bi-Weekly Meeting

2021-01-26 Thread nishith agarwal
Folks, We had some technical issues with the bi-weekly meeting invite. Please join the google hangout link below : https://hangouts.google.com/call/XXLOwp6NWMvx-4GXLRIfAEEE This is also available in the community bi-weekly link here ->

Re: User support issues

2021-02-02 Thread nishith agarwal
Thanks for doing this triaging Siva. This will help pick usability issues that don't get surfaced. I'll assign few to myself. -Nishith On Tue, Feb 2, 2021 at 8:50 AM Vinoth Chandar wrote: > Thanks for champion efforts here, to pull this list Siva. > > Can we also add a line to the contributing

Re: [DISCUSS] Refactor the Hudi configuration framework

2021-04-21 Thread nishith agarwal
+1 from me as well. This will help in maintainability immensely. -Nishith On Mon, Apr 19, 2021 at 2:06 PM Vinoth Chandar wrote: > Biggest difference from PR 1094 and the current PR open, is the addition of > fallback support and that no moving around of configs in the same PR. > This would

Re: [0.8.0 RELEASE] Codebase freeze for 0.8.0 release

2021-03-23 Thread nishith agarwal
Looks like Travis is back :) Will cut the release branch after > https://github.com/apache/hudi/pull/2708 merged. > > On Wed, Mar 24, 2021 at 3:53 AM nishith agarwal > wrote: > > > Gary, > > > > Do you have the link to the support ticket ? I can reach

Re: [0.8.0 RELEASE] Codebase freeze for 0.8.0 release

2021-03-22 Thread nishith agarwal
t, > > Gary Li > > > > On Mon, Mar 22, 2021 at 5:36 PM Danny Chan wrote: > > > > > Hi, everyone ~ > > > > > > I have post a PR today https://github.com/apache/hudi/pull/2702, and i > > > want > > > it to be in 0.8.0, > > >

Re: [Dev X] Azure Pipelines for CI

2021-03-28 Thread nishith agarwal
That's nice Raymond! This is a great step towards having nightly builds. It will help us build confidence in regular commits and avoid any surprises closer to releases as well. Thanks, Nishith On Sun, Mar 28, 2021 at 7:25 PM vino yang wrote: > Great job! Raymond. It will be very helpful to the

Re: [VOTE] Release 0.8.0, release candidate #1

2021-03-31 Thread nishith agarwal
+1 binding 1. Compilation [OK] 2. Quick start (Spark 2.x, 3.x) [OK] 3. Signature [OK] Thanks, Nishith On Wed, Mar 31, 2021 at 8:35 AM vino yang wrote: > +1 binding > > - ran `mvn clean package -DskipTests` [OK] > - quick start (Spark 2.x, 3.x) [OK] > - checked signature [OK] > > Best, > Vino

Re: [0.8.0 RELEASE] Codebase freeze for 0.8.0 release

2021-03-23 Thread nishith agarwal
; last > > > > > commit. Travis seems to have some issues cause no build is running > at > > > the > > > > > moment. > > > > > > > > > > Best, > > > > > Gary Li > > > > > > > > > >

Re: [0.8.0 RELEASE] Codebase freeze for 0.8.0 release

2021-03-22 Thread nishith agarwal
Gary/Siva, All the release blockers have been landed from my side. I took a quick look and I don't see any other release blockers at the moment. I think we can freeze the code base and start the release process. Thanks, Nishith On Sat, Mar 20, 2021 at 8:08 AM Gary Li wrote: > Thanks Siva, I

Re: 0.8.0 Release discussion

2021-03-02 Thread nishith agarwal
+1, exciting to see the progress on Flink. -Nishith On Tue, Mar 2, 2021 at 5:40 AM leesf wrote: > +1 to release monthly if possible, and thanks Danny for the great work on > Flink. > > Vinoth Chandar 于2021年3月2日周二 上午10:30写道: > > > +1 > > > > There are two more PRs to land for multi writers,

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-03 Thread nishith agarwal
This is a good idea @vino yang Have you looked into what the "automated code review" actually does ? -Nishith On Wed, Mar 3, 2021 at 7:38 AM vino yang wrote: > Hi guys, > > I want to introduce a code analysis service called lgtm[1] in the > community. Recently, in the Kylin community, I

Re: [DISCUSS] Introduce lgtm to analyze the changes of PR and simplify the cost of code review

2021-03-04 Thread nishith agarwal
e submits > code.”* > > From the official website, you can see that it supports mainstream > programming languages: C/C++, C#, Go, Java, JavaScript, Python. > > I speculate that maybe it integrates some bug static scanning tools. > > Best, > Vino > > nishith agarwal

  1   2   >