Hmm.. you might have an off-by-one error in your MasterCompute. The superstep counter is -1 during input loading and starts at 0 for the first iteration of computation. Assuming things haven't changed since 1.1.0-RC0, MasterCompute executes after the end of a superstep (after the global barrier) but before the start of a new superstep. However, the tricky bit is that it also runs after the input superstep (superstep -1). So what you might be seeing is the # of vertices after SS -1 (incorrect), after SS 0 (still incorrect), and after SS 1 (now correct).
What Steven said regarding vertex addition is correct. Internally, when there is a message for a vertex that doesn't exist, Giraph will (by default) add that vertex via a vertex mutation. These mutations are all performed during a global barrier (i.e., between SS 0 and SS 1). So for SimplePageRankComputation, you have all vertices broadcasting to their out-edge neighbours during SS 0. This means all missing vertices receive messages and so they get added after SS 0 but before SS 1. In SS 1, you will observe that all vertices without out-edge neighbours are now added. The VertexValueFactory solution works because it is called by Giraph when creating/adding these missing vertices. Peering into the internals, I believe the order of execution is: end of superstep reached -> workers flush all messages -> workers perform graph mutations -> all workers arrive at the global barrier -> master compute executes -> workers begin new superstep. (And input loading is a special case: input loading/partition exchange -> all workers arrive at the global barrier -> master compute executes -> workers begin superstep 0.) Young On Mon, May 4, 2015 at 1:16 PM, Steven Harenberg <[email protected]> wrote: > My understanding is that a vertex with only incoming edges will not be > active until it receives a message, which is why you don't see all of the > vertices initially. The easiest way to test this is to write a script that > parses your input and creates a new data file where every vertex is > specified on a line of its own. Even if it has no outgoing neighbors, just > leave the neighbor empty. Or, first just check if you have > 40383589-40103281=280308 vertices with only incoming edges. > > Young provided another solution for fixing the initialization problem, and > it looks like in the code that wasn't specified this code to still have the > problem. > > Either transform the input (seems like the easiest thing to do), or try > the fix Young said. I would bet either of those would fix the issue. Young > may have better ideas since he seems more experienced with Giraph than I am. > > --Steve > > On Sat, May 2, 2015 at 2:19 PM, Kenrick Fernandes <[email protected]> > wrote: > >> Thank you both for your responses. >> >> Steve, I faced the same problem when I created the Long input format >> files. >> I tried running the code linked by Young above, using the >> *SimplePageRankInputFormat.java* >> as well as the *SimplePageRankVertex.java* in the repo. >> >> For the Twitter dataset, I added some *MasterCompute* code to log the >> number of vertices >> that existed at each superstep. The results, however, look pretty similar >> to the previous iteration: >> >> Current step is 1 - 40103281 existed in the previous superstep 0Current step >> is 2 - 40103281 existed in the previous superstep 1 >> >> Current step is 3 - 40383589 existed in the previous superstep 2 >> >> Current step is 31 - 40383589 existed in the previous superstep 30 >> >> It seems that a subset of vertices still only become active after the >> first superstep, >> despite all vertices being initialized in superstep 0. I cant think of a >> reason why >> - thoughts ? >> >> Thanks, >> Kenrick >> >> >> >> On Wed, Apr 29, 2015 at 2:33 PM, Young Han <[email protected]> >> wrote: >> >>> For the initialization issue, you can define a (nested) class that >>> extends DefaultVertexValueFactory (from org.apache.giraph.factories) and >>> add >>> "-Dgiraph.vertexValueFactoryClass=org.apache.giraph.examples.AlgClass\$AlgVertexValueFactory" >>> after "org.apache.giraph.GiraphRunner" in your hadoop jar command. >>> >>> Also, the reason those input formats don't work is because PageRank is >>> using LongWritable for vertex id and DoubleWritable for vertex value. As >>> Roman pointed out, you have to have an input class that matches it (even if >>> the input dataset has no "double" vertex values). An example (for Giraph >>> 1.0.0) can be found here: >>> https://github.com/xvz/graph-processing/blob/master/giraph-1.0.0/giraph-examples/src/main/java/org/apache/giraph/examples/SimplePageRankInputFormat.java >>> and an example command that uses it here: >>> https://github.com/xvz/graph-processing/blob/master/benchmark/giraph/pagerank.sh#L50 >>> >>> Young >>> >>> On Wed, Apr 29, 2015 at 11:24 AM, Steven Harenberg <[email protected]> >>> wrote: >>> >>>> Hey Kenrick, >>>> >>>> First, your commands above are wrong since you are specifying adjacency >>>> list format with the -vif argument and since I believe >>>> *LongLongNullTextInputFormat >>>> *refers to adjacency list format. However, even with the right >>>> commands there will be issues and more things you need to do. >>>> >>>> I did get it the edgelist input format to work by creating a >>>> LongNullTextEdgeInputFormat.java file just like the >>>> giraph-core/src/main/java/org/apache/giraph/io/formats/IntNullTextEdgeInputFormat.java >>>> file, but with longs instead of ints (this also required creating a >>>> LongPair class). >>>> >>>> However, I would advise against using an edgelist input format in >>>> Giraph as there are major underlying issues that I never figured out how to >>>> resolve. Namely, for an edgelist format, Giraph only considers a vertex >>>> active in the first superstep if it has an outgoing edge. This means that >>>> vertices with only incoming edges won't be initialized with correct values >>>> during things like PageRank, SSSP, or WCC and hence will output incorrect >>>> results. (You can see my previous thread here: >>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201502.mbox/%3CCAHv2Baw7zFJ-s7dtNMv5dkNxz_zE436krE%2B6G4r3tp-HVgjW2g%40mail.gmail.com%3E >>>> ) >>>> >>>> The above issue can be avoided with adjacency list format by specifying >>>> the vertex with no neighbors. For example, if vertex v has only incoming >>>> edges, then you make sure there is a line with just v and no neighbors >>>> listed ( >>>> http://mail-archives.apache.org/mod_mbox/giraph-user/201408.mbox/%[email protected]%3E >>>> ). >>>> >>>> If you figure out how to resolve the edgelist input issue please let me >>>> know. >>>> >>>> Regards, >>>> Steve >>>> >>>> >>>> On Sat, Apr 25, 2015 at 9:54 PM, Kenrick Fernandes < >>>> [email protected]> wrote: >>>> >>>>> Hi Roman, >>>>> >>>>> Thanks for the quick response. There is no vertex data in this >>>>> dataset though, and the vertex IDs posted above would fit in a >>>>> Long. Would you advise changing the PageRankComputation >>>>> formats, or working on a new input format ? >>>>> >>>>> Thanks, >>>>> Kenrick >>>>> >>>>> On Sat, Apr 25, 2015 at 7:40 PM, Roman Shaposhnik < >>>>> [email protected]> wrote: >>>>> >>>>>> One of the slightly annoying things in Giraph is that you have >>>>>> to manually match your input format to your computation. In >>>>>> your case, PageRankComputation requires LongWritable for >>>>>> vertex ID and DoubleWritable for vertex Data. You may need >>>>>> to hack one of the existing formats slightly. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Roman. >>>>>> >>>>>> On Sat, Apr 25, 2015 at 2:58 PM, Kenrick Fernandes >>>>>> <[email protected]> wrote: >>>>>> > Hello, >>>>>> > >>>>>> > Im trying to get Giraph to read the Twitter dataset as input for the >>>>>> > SimplePageRankComputation program. The dataset format looks like >>>>>> this: >>>>>> > 61578010 61147436 >>>>>> > 61578037 61147436 >>>>>> > 61578040 61147436 >>>>>> > (vertex id's, with pairs representing edges) >>>>>> > >>>>>> > When I run the command with >>>>>> > -vif org.apache.giraph.io.formats.IntIntNullTextInputFormat, I get >>>>>> this >>>>>> > error : >>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex index >>>>>> types not >>>>>> > assignable, computation - class org.apache.hadoop.io.LongWritable, >>>>>> > VertexInputFormat - class org.apache.hadoop.io.IntWritable >>>>>> > >>>>>> > So I tried running the command with >>>>>> > -vif org.apache.giraph.io.formats.LongLongNullTextInputFormat and I >>>>>> get a >>>>>> > different one: >>>>>> > java.lang.IllegalArgumentException: checkClassTypes: vertex value >>>>>> types not >>>>>> > assignable, computation - class org.apache.hadoop.io.DoubleWritable, >>>>>> > VertexInputFormat - class org.apache.hadoop.io.LongWritable >>>>>> > >>>>>> > I dont understand why the types in the input show up as different >>>>>> formats in >>>>>> > each error. Also, as far as I could tell, there is no input format >>>>>> for >>>>>> > DoubleDouble. Is there a different way to get the graph into Giraph >>>>>> without >>>>>> > having to write custom input code ? Thoughts would be much >>>>>> appreciated. >>>>>> > >>>>>> > ----- >>>>>> > Reference Command: >>>>>> > hadoop jar >>>>>> giraph-examples-1.1.0-for-hadoop-1.1.2-jar-with-dependencies.jar >>>>>> > org.apache.giraph.GiraphRunner >>>>>> > org.apache.giraph.examples.PageRankComputation -vif >>>>>> > org.apache.giraph.io.formats.LongLongNullTextInputFormat -vip >>>>>> > /user/kenrick/twitter/input -op /user/kenrick/twitter/output -w 30 >>>>>> > ----- >>>>>> > >>>>>> > Thanks, >>>>>> > Kenrick >>>>>> >>>>> >>>>> >>>> >>> >> >
