Interesting find.. It looks that bit was added recently (
https://reviews.apache.org/r/17644/diff/3/) and so was not part of Giraph
1.0.0 as far as I can tell.
Also, if anyone cares, a clunky (Ubuntu) workaround I'm using is: kill $(ps
aux | grep "[j]obcache/job_[0-9]\{12\}_[0-9]\{4\}/" | awk '{print $2}')
Thanks,
Young
On Mon, Mar 17, 2014 at 6:10 PM, Craig Muchinsky <[email protected]>wrote:
> I just noticed a similar problem myself. I did a thread dump and found
> similar netty client threads lingering. After poking around the source a
> bit, I'm wondering if the problem is related to this bit of code I found in
> the NettyClient.stop() method:
>
> workerGroup.shutdownGracefully();
> ProgressableUtils.*awaitTerminationFuture*(*executionGroup*,
> context);
> *if* (executionGroup != *null*) {
> executionGroup.shutdownGracefully();
> ProgressableUtils.*awaitTerminationFuture*(executionGroup,
> context);
> }
>
> Notice that the first await termination call seems to be waiting on the
> executionGroup instead of the workerGroup...
>
> Craig M.
>
>
>
> From: Young Han <[email protected]>
> To: [email protected]
> Date: 03/17/2014 03:25 PM
> Subject: Re: Java Process Memory Leak
> ------------------------------
>
>
>
> Oh, I see. I did jstack on a cluster of machines and a single machine...
> I'm not quite sure how to interpret the output. My best guess is that there
> might be a deadlock---there's just a bunch of Netty threads waiting. The
> links to the jstack dumps:
>
> *http://pastebin.com/0cLuaF07* <http://pastebin.com/0cLuaF07>
> (PageRank, single worker, amazon0505 graph from SNAP)
> *http://pastebin.com/MNEUELui* <http://pastebin.com/MNEUELui> (MST,
> from one of the 64 workers, com-orkut graph from SNAP)
>
> Any idea what's happening? Or anything in particular I should look for
> next?
>
> Thanks,
> Young
>
>
> On Mon, Mar 17, 2014 at 12:19 PM, Avery Ching
> <*[email protected]*<[email protected]>>
> wrote:
> Hi Young,
>
> Our Hadoop instance (Corona) kills processes after they finish executing
> so we don't see this. You might want to do a jstack to see where it's hung
> up on and figure out the issue.
>
> Thanks
>
> Avery
>
>
> On 3/17/14, 7:56 AM, Young Han wrote:
> Hi all,
>
> With Giraph 1.0.0, I've noticed an issue where the Java process
> corresponding to the job loiters around indefinitely even after the job
> completes (successfully). The process consumes memory but not CPU time.
> This happens on both a single machine and clusters of machines (in which
> case every worker has the issue). The only way I know of fixing this is
> killing the Java process manually---restarting or stopping Hadoop does not
> help.
>
> Is this some known bug or a configuration issue on my end?
>
> Thanks,
> Young
>
>
>