This is a known deficiency that exists in the current API; the implementation tends to retry indefinitely and quickly.

This tends to work well when the services are functioning or failing "normally". If your DNS failure is transient, you should recover automatically, but, if it's an extended failure, you'll sit there like you're observing.

It's hard to draw the line between "expected" or recoverable failures and failures that you want to propagate back to your client. I'm not sure if this is something that's planning on being addressed in the new client API or not (https://issues.apache.org/jira/browse/ACCUMULO-2589).

Ariel Valentin wrote:
We have a very peculiar situation, where a DNS failure is causing our
application to hang.

Based on the trace debugging logs it appears that the ThriftScanner
encounters a TTransportException, which was caused by an
UnknownHostException. It seems to then retry a few seconds later.

http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.accumulo/accumulo-core/1.6.0-cdh4.6.0/org/apache/accumulo/core/client/impl/ThriftScanner.java/#124

https://gist.github.com/arielvalentin/794415d1744e52984d0d

After tracing the code a bit I realized that we could mitigate the
"hanging" by setting a timeout on our scans/writes however I would
prefer that the client would fail faster if it could not resolve the
hostnames of the TServers it found in zookeeper.

Thoughts? Concerns? Opinions?

Ariel Valentin
e-mail: [email protected] <mailto:[email protected]>
website: http://blog.arielvalentin.com
skype: ariel.s.valentin
twitter: arielvalentin
linkedin: http://www.linkedin.com/profile/view?id=8996534
---------------------------------------
*simplicity *communication
*feedback *courage *respect

Reply via email to