Re: Resources or advice on minimising memory usage in Giraph/Hadoop code ?

Claudio Martella Wed, 06 Jun 2012 23:36:08 -0700

Won't this just postpone the pain?

On Thursday, June 7, 2012, David Garcia wrote:


>  Based upon what you have mentioned, o think you are getting heap errors
> because every vertex in your graph will be loaded into memory prior to
> super step one.  So if you have a large graph, with lots of state, you
> probably have memory issues from the very beginning.  A simple way to
> mitigate the problem is to simply load the vertices that you need and then
> add vertices as your computation progresses.  This will prevent the entire
> graph from occupying memory.
>
> Sent from my HTC Inspire™ 4G on AT&T
>
> ----- Reply message -----
> From: "Avery Ching" <[email protected] <javascript:_e({}, 'cvml',
> '[email protected]');>>
> To: "[email protected] <javascript:_e({}, 'cvml',
> '[email protected]');>" <[email protected] <javascript:_e({},
> 'cvml', '[email protected]');>>
> Subject: Resources or advice on minimising memory usage in Giraph/Hadoop
> code ?
> Date: Wed, Jun 6, 2012 10:33 pm
>
>
>
> No article or book, but here's a few tips.
>
> 1) Use aggregators!  This can drastically can reduce the amount of
> memory use by combining messages on the server side.
> 2) -Dmapred.child.java.opts="-Xss128k" or some other value (should
> affect the RPC threads or netty threads)
> 3) You'll want to minimize the state of every vertex as best as
> possible, perhaps creating a custom vertex.
>
> Avery
>
> On 6/5/12 7:38 PM, Benjamin Heitmann wrote:
> > Hello,
> >
> > can somebody recommend a web page, article or book on minimising the
> memory usage of Giraph/Hadoop code ?
> > I am looking for non-obvious advice on what *not* to do, and for best
> practices on what to do inside of Hadoop...
> >
> > E.g. is it preferable to use Java Strings or Hadoop Text Writables ?
> Should all strings be externalised ?
> >
> > Currently, I am running a Giraph job with 10 workers. Each worker has a
> maximum heap of Xmx7G.
> > The concurrent garbage collection is enabled. The machine has 24 cores,
> and 96 GB of memory.
> > The job currently uses a max of around 50 GB, so there is free memory
> available outside of java.
> >
> > The graph itself has ~2 million vertices and ~4 million edges, which is
> not really "big data".
> >
> > However, before starting superstep 1, I get heap space errors. Previous
> versions of my algorithm where simpler,
> > but they also ran into heap space errors when the data was around one
> order of magnitude bigger.
> >
> > My suspicion is that the amount of state which my vertices have, and the
> amount of messages which I am generating
> > exceeds the standard use case of a pagerank rank algorithm by far.
> >
> > To list a few of the reasons why I need a lot of state:
> >
> > * I need to execute multiple runs of the same algorithm in parallel.
> Loading this specific graph takes about 3 minutes,
> > running the algorithm once takes about 10 seconds or so, but I have
> around 600 users in that graph. And this is just a small graph,
> > the whole algorithm is intended to be run for thousands of users. (...
> "big data"...)
> >
> > * The identities of the edges and vertices are not based on numbers but
> on strings.
> > All edges and all vertices have a URI associated with them.
> > The graph represents RDF data from different sources, such as DBpedia.
> > In addition, most of the vertices have one or multiple types associated
> with them, and
> > each type is again represented by a URI.
> > These types are essential to the logic of the algorithm.
> > I guess it would be possible to externalise all of those strings, but it
> adds a layer of complexity which I had previously hoped to avoid.
> >
> > * As Giraph does not currently provide a central coordination point for
> the processing of the graph,
> > I need to send a lot of messages between vertices in order to coordinate
> the algorithm.
> >
> > * Giraph does not allow multiple Java classes to be used for different
> vertices in the same graph.
> > However, different vertices have different roles in my algorithm, and
> each role has a different set of states in which it can be,
> > due to the missing global coordination point.
> >
> > * Taken together, the lack of a central coordination point and the
> inabiltity to have different java classes as part of the same graph,
> > make the whole algorithm more similar to a network protocol and not to a
> graph algorithm. Thus I need a lot of messages
> > and a lot of state.
> >
> >
> > If anybody has some good suggestion on how I should proceed, I would be
> very interested in hearing them.
> >
> > If somebody wants to take a look at my code, then I can currently
> provide you with that code in a non-public way.
> >
> > sincerely, Benjamin Heitmann.
> >
>
>

-- 
   Claudio Martella
   [email protected]

Re: Resources or advice on minimising memory usage in Giraph/Hadoop code ?

Reply via email to