Won't this just postpone the pain? On Thursday, June 7, 2012, David Garcia wrote:
> Based upon what you have mentioned, o think you are getting heap errors > because every vertex in your graph will be loaded into memory prior to > super step one. So if you have a large graph, with lots of state, you > probably have memory issues from the very beginning. A simple way to > mitigate the problem is to simply load the vertices that you need and then > add vertices as your computation progresses. This will prevent the entire > graph from occupying memory. > > Sent from my HTC Inspire™ 4G on AT&T > > ----- Reply message ----- > From: "Avery Ching" <[email protected] <javascript:_e({}, 'cvml', > '[email protected]');>> > To: "[email protected] <javascript:_e({}, 'cvml', > '[email protected]');>" <[email protected] <javascript:_e({}, > 'cvml', '[email protected]');>> > Subject: Resources or advice on minimising memory usage in Giraph/Hadoop > code ? > Date: Wed, Jun 6, 2012 10:33 pm > > > > No article or book, but here's a few tips. > > 1) Use aggregators! This can drastically can reduce the amount of > memory use by combining messages on the server side. > 2) -Dmapred.child.java.opts="-Xss128k" or some other value (should > affect the RPC threads or netty threads) > 3) You'll want to minimize the state of every vertex as best as > possible, perhaps creating a custom vertex. > > Avery > > On 6/5/12 7:38 PM, Benjamin Heitmann wrote: > > Hello, > > > > can somebody recommend a web page, article or book on minimising the > memory usage of Giraph/Hadoop code ? > > I am looking for non-obvious advice on what *not* to do, and for best > practices on what to do inside of Hadoop... > > > > E.g. is it preferable to use Java Strings or Hadoop Text Writables ? > Should all strings be externalised ? > > > > Currently, I am running a Giraph job with 10 workers. Each worker has a > maximum heap of Xmx7G. > > The concurrent garbage collection is enabled. The machine has 24 cores, > and 96 GB of memory. > > The job currently uses a max of around 50 GB, so there is free memory > available outside of java. > > > > The graph itself has ~2 million vertices and ~4 million edges, which is > not really "big data". > > > > However, before starting superstep 1, I get heap space errors. Previous > versions of my algorithm where simpler, > > but they also ran into heap space errors when the data was around one > order of magnitude bigger. > > > > My suspicion is that the amount of state which my vertices have, and the > amount of messages which I am generating > > exceeds the standard use case of a pagerank rank algorithm by far. > > > > To list a few of the reasons why I need a lot of state: > > > > * I need to execute multiple runs of the same algorithm in parallel. > Loading this specific graph takes about 3 minutes, > > running the algorithm once takes about 10 seconds or so, but I have > around 600 users in that graph. And this is just a small graph, > > the whole algorithm is intended to be run for thousands of users. (... > "big data"...) > > > > * The identities of the edges and vertices are not based on numbers but > on strings. > > All edges and all vertices have a URI associated with them. > > The graph represents RDF data from different sources, such as DBpedia. > > In addition, most of the vertices have one or multiple types associated > with them, and > > each type is again represented by a URI. > > These types are essential to the logic of the algorithm. > > I guess it would be possible to externalise all of those strings, but it > adds a layer of complexity which I had previously hoped to avoid. > > > > * As Giraph does not currently provide a central coordination point for > the processing of the graph, > > I need to send a lot of messages between vertices in order to coordinate > the algorithm. > > > > * Giraph does not allow multiple Java classes to be used for different > vertices in the same graph. > > However, different vertices have different roles in my algorithm, and > each role has a different set of states in which it can be, > > due to the missing global coordination point. > > > > * Taken together, the lack of a central coordination point and the > inabiltity to have different java classes as part of the same graph, > > make the whole algorithm more similar to a network protocol and not to a > graph algorithm. Thus I need a lot of messages > > and a lot of state. > > > > > > If anybody has some good suggestion on how I should proceed, I would be > very interested in hearing them. > > > > If somebody wants to take a look at my code, then I can currently > provide you with that code in a non-public way. > > > > sincerely, Benjamin Heitmann. > > > > -- Claudio Martella [email protected]
