Hi Yi - If I understand your use case correctly, ideally, you'd want the number of tasks in job_2 to be dynamic, and more specifically, the number of tasks in job_2 would directly depend on the result of job_1. Here are some ideas:
1. Using Task Framework user content store to enable dynamic submission of workflows. If you aren't familiar with Task Framework's user content store, it is a feature that allows the workflow/job/task logic to store key-value pairs, given the caveat that the lifecycle of the data corresponds to the lifecycle of the workflow/job/task (the API requires you to specify the scope). For example, have job_1 temporarily store the result and job_2's task logic could use the result to spawn single-tasked workflows. 2. If you are already modeling your databases as Helix generic resources, you could make your jobs targeted with the target resource name, partition states, and a command. For targeted jobs, tasks will be dynamically generated to target the said resources. However, if you're not already modeling your DBs as Helix resources, this might not be as straightforward as it sounds. Hope this helps and perhaps others could chime in as well if they have any ideas, Hunter On Thu, Oct 11, 2018 at 11:26 AM Yi Chen <[email protected]> wrote: > Hello, > > I am new to the Task Framework and need help understanding a few concepts. > What is the best practice for jobs with dependencies, while the number of > tasks also depend on the parent job? > > For example, the job_1 is to list all databases, and job_2 is to list all > tables for all databases found from the result of job_1. The workflow > examples I found either define the tasks statically, or starting a fixed > number of tasks for a job. > > If I understand correctly, since I don't know exactly how many tasks I > need in job_2, I should do my best guess and use a larger number as the > number of partitions. For example, when I start the workflow, I can > configure the job_2 to run 10 tasks, no matter how many databases exists. > If there are 100 databases exists as the result of job_1, Helix Task > Framework will somehow assign 5 databases to each task. Is this correct? > > Thanks, > Yi >
