Ray Yang created GOBBLIN-336:
Summary: Gobblin Cluster Job Isolation
Key: GOBBLIN-336
URL: https://issues.apache.org/jira/browse/GOBBLIN-336
Project: Apache Gobblin
Issue Type: Improvement
Reporter: Ray Yang
Gobblin cluster runs Gobblin jobs. Each cluster worker host runs jobs in a
thread pool in a single JVM. The thread pool is reused for next jobs after
previous jobs finish.
Gobblin cluster recently ran into issues with resource leakage. The cluster
would fail all job executions when certain resources such as threads were
exhausted. To recover, the whole cluster has to be restarted and jobs have to
be retried. With the expected increase in the number of jobs executed, such
errors happen more frequently. We have identified the causes and fixes have
been verfied. However, there are concerns that unknown similar bugs may show up
later that may bring the whole cluster down.
In general, any bug in one job’s code may affect the executions of another job
since they run in the same JVM. It’s also possible that a bug will only be
triggered by certain input data which is specific to a subset of jobs.
The cluster will be more robust if a job execution is better isolated from
another job.
In the future, we expect jobs will become more diverse as more use cases are
on-boarded. The need for job isolation will become more important over time.
In the future job isolation may be required for security reasons too.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)