Hi Hunter,

I was able to figure out the reason why my task code stopped working after
a few minutes: I was using the embedded controller which was not configured
right, and the controller stopped working. I have a few follow-up questions:

[1] Since my tasks are dynamic (since I spin up one task for each query
result from a dependency), does it mean I should not use the "recurrent
workflow"? My understanding is that the recurrent workflow is one
long-running workflow that repeat on the same data set.

[2] I am creating a one-off workflow with a regular interval, so that each
workflow include different task set depends on the upstream dependency. I
would like the workflow to be cleaned up before the next run. What is the
best practice for such case? I am using the following three lines to wait
for the termination of the workflow:
* driver.pollForWorkflowState(workflowName, TIMED_OUT, FAILED, STOPPED,
ABORTED, COMPLETED);
* driver.waitToStop(workflowName, 5000);
* driver.delete(workflowName, false);

I am getting many error messages like the following. I suspect this is
caused by a force deletion that's not clean?

2018-11-03 13:19:44 WARN  ZkClient:1164  Failed to delete path
> /jobserver-staging/PROPERTYSTORE/TaskRebalancer/ClusterMonitor_ListApplications!
> org.I0Itec.zkclient.exception.ZkException:
> org.apache.zookeeper.KeeperException$NotEmptyException: KeeperErrorCode =
> Directory not empty for
> /server-staging/PROPERTYSTORE/TaskRebalancer/ClusterMonitor_ListApplications


Thanks,
Yi

On Tue, Oct 16, 2018 at 12:01 AM Hunter Lee <[email protected]> wrote:

> Hi Yi,
>
> Are you calling pollForWorkflowState for TaskState.COMPLETED? Note that if
> you have set up a generic workflow, it is going to end up in COMPLETED
> state, but if you have a JobQueue (a special kind of workflow that never
> dies), it will never get to the COMPLETED state. If you're not familiar how
> these two are different, send a quick note and I could try explain further.
> The HelixException you're getting is probably due to the polling API timing
> out.
>
> It all depends on how you set up a workflow, but I get a feeling that your
> first task might not be finishing. I do not see any blatant
> misunderstanding as far as how to use Task Framework from what you
> described, but if I were you, I'd try to see what state these workflows,
> jobs, and tasks are actually in.
>
> You could use ZooInspector to take a quick look at Helix internals. You
> connect to the ZooKeeper address, and in the Helix's directory structure,
> you want to look at /PROPERTYSTORE/TaskRebalancer/<your workflow or job
> name>. This is where what's so called workflow and job "contexts" are
> stored - contexts contain run-time details pertaining to workflows and
> jobs. Individual job states will be listed in WorkflowContext, and
> individual task states will be in the mapField of JobContext.
>
> Let me know if this is helpful,
> Hunter
>
> On Mon, Oct 15, 2018 at 4:51 PM Yi Chen <[email protected]> wrote:
>
>> Hi Hunter,
>>
>> Thanks for the response. I am trying to get a "Hello World" example
>> working to test the Task Framework, by following a few unit tests in the
>> source code. Here is my setup:
>>
>> [1] In the same Helix cluster, I have two nodes that participant in
>> leader election, one of them becomes the leader. The leader is responsible
>> for starting and stoping the Task Framework workflows. Every few minutes,
>> the leader will contact an external API, to fetch a list of work items, and
>> start a workflow with one job, which contains one TaskConfig per work item.
>> [2] In the same Helix cluster, I added a few worker nodes, their sole
>> responsibility is to act as the Helix Participants to run the distributed
>> tasks, managed by the Task Framework.
>>
>> For some reason, I can only get the example working if there is only one
>> task in the workflow. If I have two or more tasks in the job, or if I have
>> multiple workflows, each contains one task, the workflows always get stuck
>> in the IN_PROGRESS state, causing HelixException when I run
>> driver.pollForWorkflowState(workflowName). Do you know what can be missing?
>>
>> Thanks,
>> Yi
>>
>> On Thu, Oct 11, 2018 at 3:26 PM Hunter Lee <[email protected]> wrote:
>>
>>> Hi Yi -
>>>
>>> If I understand your use case correctly, ideally, you'd want the number
>>> of tasks in job_2 to be dynamic, and more specifically, the number of tasks
>>> in job_2 would directly depend on the result of job_1. Here are some ideas:
>>>
>>> 1. Using Task Framework user content store to enable dynamic submission
>>> of workflows.
>>> If you aren't familiar with Task Framework's user content store, it is a
>>> feature that allows the workflow/job/task logic to store key-value pairs,
>>> given the caveat that the lifecycle of the data corresponds to the
>>> lifecycle of the workflow/job/task (the API requires you to specify the
>>> scope).
>>> For example, have job_1 temporarily store the result and job_2's task
>>> logic could use the result to spawn single-tasked workflows.
>>>
>>> 2. If you are already modeling your databases as Helix generic
>>> resources, you could make your jobs targeted with the target resource name,
>>> partition states, and a command. For targeted jobs, tasks will be
>>> dynamically generated to target the said resources. However, if you're not
>>> already modeling your DBs as Helix resources, this might not be as
>>> straightforward as it sounds.
>>>
>>> Hope this helps and perhaps others could chime in as well if they have
>>> any ideas,
>>> Hunter
>>>
>>> On Thu, Oct 11, 2018 at 11:26 AM Yi Chen <[email protected]> wrote:
>>>
>>>> Hello,
>>>>
>>>> I am new to the Task Framework and need help understanding a few
>>>> concepts. What is the best practice for jobs with dependencies, while the
>>>> number of tasks also depend on the parent job?
>>>>
>>>> For example, the job_1 is to list all databases, and job_2 is to list
>>>> all tables for all databases found from the result of job_1. The workflow
>>>> examples I found either define the tasks statically, or starting a fixed
>>>> number of tasks for a job.
>>>>
>>>> If I understand correctly, since I don't know exactly how many tasks I
>>>> need in job_2, I should do my best guess and use a larger number as the
>>>> number of partitions. For example, when I start the workflow, I can
>>>> configure the job_2 to run 10 tasks, no matter how many databases exists.
>>>> If there are 100 databases exists as the result of job_1, Helix Task
>>>> Framework will somehow assign 5 databases to each task. Is this correct?
>>>>
>>>> Thanks,
>>>> Yi
>>>>
>>>

Reply via email to