Hello everybody,

 

first of all thank you for all your work on giraph.

I'm a student writing his bachelor thesis using giraph.

I have already implemented an algorithm that isn't completely trivial, with
my new task, however, I'm running into problems.

 

I implemented my first algorithm as a subclass of MasterCompute where I
register aggregators etc and then do a switch on the superstep and call
setComputation() with the appropriate AbstractComputation subclass I wrote.
(I hope this is how you're supposed to do it)

Now I want to implement a new algorithm that calls my new algorithm as a
subroutine: While the halting condition is not fulfilled, I first partition
the graph using algorithm 1, then continue to work on it with algorithm 1.
Is there a way to do this without duplicating a lot of code of the
"standalone" MasterCompute class of algorithm 1? An additional problem that
arises with this is that the Vertex- and EdgeValue classes would have to be
exactly the same. I tried to work around this by defining a writable
interface that defines all the methods needed for my algorithms which would
make the actual classes exchangeable. However, giraph uses reflection to
create a new instance of exactly the type the computation uses (an
interface), which leads to an error. Is there any good way to do this or are
giraph jobs not that flexible? 

 

In a variation of the first algorithm, I need to register n aggregators (n =
number of vertices). However, the documentation reads like I have to
register aggregators in MasterCompute.initialize() (nowhere else) and at
that point, getTotalNumVertices() does not work yet. I am aware that this is
a very costly operation (I do not use all of those aggregators but I do not
know beforehand which of them I will use and which I will not use), but
currently the only workaround I can think of is using
Configuration.getInt(). (and writing the total number of vertices in such a
config file beforehand)

Also a question on the config files: I understood that there are several of
them that overwrite each other and that e.g. I should not change the
core-default.xml or core-site.xml since they will change my complete Hadoop
installation. I also read somewhere that it's possible to write a config
file just for one job (which would be what I need) but I never found out how
that file should be called/where I have to place it.

 

My biggest problem right now is debugging, though: Is there an easy way to
test giraph code on small sample graphs? Right now, to test my
implementations I have to package my code with maven, copy the long command
into the terminal to run the giraph job (changing the output folder since
they have to be different each time), wait a few minutes for the job to
complete, open the web GUI, click through a few pages there until I see my
debug statements/if the job completed I have to run through a text file via
the console. Compared to what I was used to (1 click in eclipse and almost
instantly seeing the output on the console) this is very annoying,
especially since I do dumb small mistakes like switching the if and else
blocks more often than I'd like to and have to go through the whole process
each time that happens.

I also searched for that, I found GRAFT which seemed to be a useful
debugging tool, but more for suitable for testing on real input and not to
quickly test if the code runs at all on a small input graph.

After searching through this mailing list archive, there were a few
references to running a giraph job with one click in eclipse aswell (see
also [1]), but most descriptions were very vague and I could not reproduce
them.

 

Lastly, one small question: In my first algorithm I had a small bug where I
would use getVertexValue, then change the java object but not call
setVertexValue which resulted in my changes not being saved leading to
undesired behavior. After reading through another giraph algorithm, I
noticed that they do the same (maybe it was with an EdgeValue, I'm not 100%
sure on that) and don't call setXValue, but apparently their code works. Can
anybody shed some light on that? (I understand why it's useful to have an
explicit setVertexValue method for writing/reading vertices to/from disk, I
just don't understand why it is not necessary for them?)

 

Thanks,

Jan

 

 

[1]
http://ben-tech.blogspot.in/2011/08/how-to-debug-hadoop-mapreduce-jobs-in.ht
ml

Reply via email to