Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Thanks. After reading the discussion in HUDI-561, I just realized that the previously-mentioned built-in partition transformer is better suited to a custom key generator. Hopefully other suitable ideas of built-in transformer would come up later. On Sun, Feb 23, 2020 at 6:34 PM vino yang wrote:

Multiple clean instants with same timestamp

2020-02-23 Thread Pratyaksh Sharma
Hi, I recently came across a strange issue for table T. For the same timestamp, 2 clean instants were present in .hoodie folder, one of them in completed state and other one in inflight state. As a result, if I try to run cleaner or DeltaStreamer for this table T, it was failing with the below

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread vino yang
Hi Shiyan, Really sorry, I forgot to attach the reference, the relevant Jira ID is HUDI-561: https://issues.apache.org/jira/browse/HUDI-561 It seems both of you faced the same issue. While the solution is not the same. Never mind, you can move the discussion to that issue. Best, Vino Shiyan

Bring back support for spark 2.3?

2020-02-23 Thread Pratyaksh Sharma
Hi, As discussed in last to last week's weekly sync, I want to put forward this point on our mailing list also. Since with 0.5.1 release, we have upgraded spark to 2.4 in our master branch, we are facing difficulties after rebasing our codebase with master. At our organisation we are using spark

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Late to the party. :P I really favor the idea of built-in support enrichment. It is a very common case where we want to set datetime fields for partition path. We could have a built-in support to normalize ISO format / unix timestamp. For example `HourlyPartitionTransformer` will normalize

Need clarity on these test cases in TestHoodieDeltaStreamer

2020-02-23 Thread Pratyaksh Sharma
Hi, While working on one of my PRs, I am stuck with the following test cases in TestHoodieDeltaStreamer - 1. testUpsertsCOWContinuousMode 2. testUpsertsMORContinuousMode For both of them, at line [1] and [2], we are adding 200 to totalRecords while asserting record count and distance count

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread vino yang
Hi Shiyan, Thanks for rasing this thread up again and sharing your thoughts. They are valuable. Regarding the date-time specific transform, there is an issue[1] that describes this business requirement. Best, Vino Shiyan Xu 于2020年2月24日周一 上午7:22写道: > Late to the party. :P > > I really favor

Re: [DISCUSS] RFC - 08 : Record level indexing mechanisms for Hudi datasets

2020-02-23 Thread vino yang
Hi Sivabalan, Thanks for your proposal. Big +1 from my side, indexing for record granularity is really good for performance. It is also towards the streaming processing. Best, Vino Sivabalan 于2020年2月23日周日 上午12:52写道: > As Aapche Hudi is getting widely adopted, performance has become the need

Re: Refactor and enhance Hudi Transformer

2020-02-23 Thread Shiyan Xu
Thanks Vino. Are you referring to HUDI-613? How about making it an umbrella task due to its big scope? (btw it is stated as "bug", which should be fixed too). I can create another specific task under it for the idea of datetime -> partition path transformer, if it makes sense. On Sun, Feb 23,