Re: [I] [to be discussed] Support for rollbackFailedWrites to delete inactive clustering plans [hudi]
nsivabalan commented on issue #17879: URL: https://github.com/apache/hudi/issues/17879#issuecomment-3881384507 gotcha. To summarize, here are the diff requirements we have around table service orchestration and deployment: - Compaction R1: Only one TS writer should be able to execute a given compaction plan. -> Emit heart beats during execution. - Clustering R1: Only one TS writer should be able to execute a given clustering plan. -> Emit heart beats during execution. R2: Only one TS writer should be able to perform both planning and execution. If execution fails after planning for any reason, the plan should be cleaned up by some TS writer after some threshold. -> Concurrent writer or TS planner could hit FileNotFound issue if not for abort state. -> Abort state or go w/ inflight deletion/nuking of plans. Let me know if this is right -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [to be discussed] Support for rollbackFailedWrites to delete inactive clustering plans [hudi]
kbuci commented on issue #17879: URL: https://github.com/apache/hudi/issues/17879#issuecomment-3776026228 Sorry let me clarify; our goal is - have a way for clean to automatically rollback (and delete plans of) clustering writes that have failed. - a new utility API that takes a list of partitions and attempts a rollback (and delete plans of) of any failed clustering writes targeting the same partition Normally this would be unsafe (without implementing the RFC) due to the multi writer cases you mentioned. But we are able to implement this cleanup logic internally since for us: - Clustering is only attempted by a dedicated table service writer job that directly calls clustering APIs. In addition, we ensure that scheduling and execution will happen in the same job. - We don't enable clustering "inline" within write commit / deltastreamer Because of this, in our internal build we can just start a heartbeat in the schedule clustering call. And similar to ingestion writes, once heartbeat is expired HUDI can assume that the plan is safe to rollback and delete (either by clean or the aforementioned utility API) > Are you asking for the proposed RFC to be implemented. No we don't have to implement the RFC for this, although that would be an ideal long term solution to maintain our clustering flow with existing OSS flow. > You are looking for support to rollback and nuke an existing clustering plan(already scheduled). This should not be a big ask, assuming you can enable the config for just 1 of the dedicated table service writer and whenever it detects a pending clustering plan in the timeline, it could rollback and nuke the plan. Yes, though both the clustering and clean jobs should have this config enabled. Alternatively this can be made as a table level config. > But chances that another concurrent ingestion writer could result in file not found issue which needs to be tackled. Yes thats right, we have an internal fix we can upstream for this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
Re: [I] [to be discussed] Support for rollbackFailedWrites to delete inactive clustering plans [hudi]
nsivabalan commented on issue #17879: URL: https://github.com/apache/hudi/issues/17879#issuecomment-3766412418 I will focus on the problem statement before diving into the solution. Let me know if my understanding is right. 1. You are looking for support to rollback and nuke an existing clustering plan(already scheduled). This should not be a big ask, assuming you can enable the config for just 1 of the dedicated table service writer and whenever it detects a pending clustering plan in the timeline, it could rollback and nuke the plan. But chances that another concurrent ingestion writer could result in file not found issue which needs to be tackled. That's why we proposed https://github.com/apache/hudi/pull/12856 Are you asking for the proposed RFC to be implemented. Can you help clarify please. 2. Based on your requirements, ask is not as simple as support 1. You could have multiple table service writers, where table service writer 1 and table service writer 2 could contend to perform clustering for the same table based on how the table services is orchestrated. Say, we have a pending clustering plan in the timeline, unless we have a heart beat, how would the other table service writer knows that a given clustering instant is being worked upon or no? Does that mean that, you already have incorporated heart beats for table services? 2.b. If my hunch is right, the heart heats are enabled for both schedule and execution of clustering. but typically schedule and execution can be de-coupled. So, I am not sure how would we enable heart beats for scheduling in such cases. or after scheduling, if the writer shuts down, the heart beat could be seen as expired right. But we did not want to rollback and nuke the clustering plan in this case. W/o going into the solution, can you help clarify the problem statement and requirements. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
