Hi,
I turned on container reuse and upped the time that containers linger after
task vertex completion (tez.am.container.session.delay-allocation-millis),
but I'm still having an issue. Sometimes, the Processor I created will
fail due to application logic in one DAG but not the next. The trivial
example is:
class MyProcessor implements LogicalIOProcessor {
// Other non-application logic code
public void run(...) {
if (new Random().nextBoolean()) {
throw new FooBarBazException();
}
}
}
In this case I don't want the task JVM to be deallocated because it was
application logic that caused the failure and next time I start a DAG I
will have the long JVM task startup delay.
I see the following code in the source (TaskScheduler#deallocateTask(...))
that I think is the cause of this:
if (!taskSucceeded || !shouldReuseContainers) {
if (LOG.isDebugEnabled()) {
LOG.debug("Releasing container, containerId=" +
container.getId()
+ ", taskSucceeded=" + taskSucceeded
+ ", reuseContainersFlag=" + shouldReuseContainers);
}
releaseContainer(container.getId());
}
Is this something that can be fixed in master? Or is there a
workaround/conf I can set to get this working?
Thanks,
Thad