Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Accumulation should be possible with aggregators and the idea with applying preference vectors seems very interesting too. Maybe these should be implemented as additional features in GIRAPH-191? On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote: Could you just accumulate the PR of dangling nodes in a variable and divide it equally in the next iteration? If you use a preference vector instead of dividing it equally, then you have also the strongly preferential (and weakly preferential in case you divide equally) versions of RWR. Cheers, -- Gianmarco On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna castagna.li...@googlemail.com wrote: Sebastian Schelter wrote: You are right, the code would throw an exception here, but I guess it would be sufficient to simply add a check here and not send any messages. That would avoid throwing an exception, but I am not sure it is actually the best thing to do. Some suggest to divide the PageRank scores of dangling nodes evenly among all other pages. If you want, this is equivalent to another random jump, but forced by the fact that there are no outgoing links to select from. Using your notation below: 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + sumOfDanlingNodesRanks / getNumVertices() ) See also pag. 38 of the book Google's PageRank and Beyond: http://books.google.co.uk/books?id=hxvB14-I0twCpg=PA38 In other works, in pseudo code a serial implementation which accounts for dangling nodes should be: dangling_nodes_ranks = 0 foreach node in dangling_node dangling_nodes_ranks += ranks.get ( node ) dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; foeach node1 in graph r = 0 foreach node2 in incoming_links ( node1 ) r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N What do you think? Paolo --sebastian On 17.05.2012 23:44, Paolo Castagna wrote: Hi Sebastian, from SimplePageRankVertex.java, we have: if (getSuperstep() MAX_SUPERSTEPS) { long edges = getNumOutEdges(); sendMsgToAllEdges( new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? Paolo Sebastian Schelter wrote: The implementation implicitly deals with dangling nodes by computing the pageRank as 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks Conceptually, this is the same as connecting every vertex with every other vertex it is not yet connected to. This removes all dangling nodes from the graph as every vertex can reach every other vertex. Mining of Massive Datasets [1] has a very well written explanation of this approach. --sebastian [1] http://infolab.stanford.edu/~ullman/mmds.html On 17.05.2012 16:23, Paolo Castagna wrote: Hi, I noticed that the SimplePageRankVertex implementation does not deal with dangling nodes (dangling nodes are those nodes with no outgoing links). And, probably that is one of the good reasons why there is a Simple in front. ;-) Dangling nodes exist in many forms. For example, a page of data, a page with a postscript graph, [...], a page that has been fetched by a crawler but not yet explored - these are all examples of possible dangling nodes. The more ambitious the crawl, the bigger the proportion of dangling nodes because the set of fetched but uncrawled pages grows quickly. (pag. 80) [1] Wikipedia's PageRank page says: If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL at random and continues surfing again. When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection. Their PageRank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability usually set to d = 0.85, estimated from the frequency that an average surfer uses his or her browser's bookmark feature. [2] Now, if I want to try to account for the dangling nodes I need to sum all PageRank values for the dangling nodes and divide it evenly among all other pages (which in practce means to send a message to all vertexes in Giraph... which is not possible, as it would be crazy with large graphs). I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are computed only every 2 supersteps). What do you think? Paolo [1] Amy N. Langville and Carl D. Meyer, Google's PageRank and Beyond: The Science of Search Engine Rankings (2006) [2] http://en.wikipedia.org/wiki/PageRank
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Paolo, Very good points, could you add them to GIRAPH-191? I think that the convergence check does not need an extra superstep. It is sufficient to compute the L1-norm of the current pageRank vector, which can be done by an aggregator and compare it to the previous aggregated value. I guess that summing up and redistributing the pagerank of dangling vertices can also be done without an extra superstep in an aggregator. --sebastian On 18.05.2012 12:31, Paolo Castagna wrote: Hi Gianmarco Gianmarco De Francisci Morales wrote: Could you just accumulate the PR of dangling nodes in a variable and divide it equally in the next iteration? Yep, this is exactly what I was thinking when I wrote: I imagine one can use a SumAggregator and alternate computing contribution from dangling nodes and PageRank values (i.e. PageRank values are computed only every 2 supersteps). You need to use an Aggregator and spend a superstep just to aggregate/sum the PageRank of all danling nodes. Then, in the successive superstep you can use that divided by the number of nodes in the graph to equally among all other nodes (i.e. the sumOfDanlingNodesRanks / getNumVertices() below). Another, change/enhancement could be to check for convergence (to do that, you need to keep current and previous PageRank values) and spend another superstep just doing that. But, as you can see things stop being simple and the point of SimplePageRankVertex is to offer a very simple example. I did not started this thread with the aim to change SimplePageRankVertex, I think it should stay as it is (and as the Google's Pregel paper describes it). However, in practice, if you want to run PageRank or similar, you need to deal with problems such as: what to do with dangling nodes/pages? how to know how many iterations you need to reach the precision you want/need? Etc. My initial message was to exchange some ideas and get some feedback and/or suggestions on how a not too simple PageRankVertex might look. ;-) If you use a preference vector instead of dividing it equally, then you have also the strongly preferential (and weakly preferential in case you divide equally) versions of RWR. Yep, but how you get your preference vector? ;-) As a first step, it would be great if we can come up with a PageRankVertex which takes into account dangling nodes and perhaps/optionally check for convergence every few supersteps. Anyway, thanks for your comments and feedback Gianmarco. Paolo Cheers, -- Gianmarco On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna castagna.li...@googlemail.com mailto:castagna.li...@googlemail.com wrote: Sebastian Schelter wrote: You are right, the code would throw an exception here, but I guess it would be sufficient to simply add a check here and not send any messages. That would avoid throwing an exception, but I am not sure it is actually the best thing to do. Some suggest to divide the PageRank scores of dangling nodes evenly among all other pages. If you want, this is equivalent to another random jump, but forced by the fact that there are no outgoing links to select from. Using your notation below: 0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks + sumOfDanlingNodesRanks / getNumVertices() ) See also pag. 38 of the book Google's PageRank and Beyond: http://books.google.co.uk/books?id=hxvB14-I0twCpg=PA38 http://books.google.co.uk/books?id=hxvB14-I0twCpg=PA38 In other works, in pseudo code a serial implementation which accounts for dangling nodes should be: dangling_nodes_ranks = 0 foreach node in dangling_node dangling_nodes_ranks += ranks.get ( node ) dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ; foeach node1 in graph r = 0 foreach node2 in incoming_links ( node1 ) r += ranks.get ( node2 ) / number_outgoing_links ( node2 ) r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N What do you think? Paolo --sebastian On 17.05.2012 23:44, Paolo Castagna wrote: Hi Sebastian, from SimplePageRankVertex.java, we have: if (getSuperstep() MAX_SUPERSTEPS) { long edges = getNumOutEdges(); sendMsgToAllEdges( new DoubleWritable(getVertexValue().get() / edges)); } else { voteToHalt(); } What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)? Paolo Sebastian Schelter wrote: The implementation implicitly deals with dangling nodes by computing the pageRank as 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks Conceptually, this is the same as connecting every vertex with every other vertex
Re: RWR on Giraph
Hi, you are completely right. I also started implementing RWR today coincidently, could you file a JIRA ticket for RWR? I would attach my work done so far and we could work a little on the code. I think that a lot of things need improvement in the PageRank/RWR implementation, e.g. stuff like the teleportation probability should be configurable. Furthermore you shouldn't have to specify the number of supersteps that need to be executed, but convergence should be checked somehow via an aggregator. Best, Sebastian On 17.05.2012 22:58, Gianmarco De Francisci Morales wrote: Hi Giraphers, I am implementing a Random Walk with Restart on Giraph. As far as I have understood, the only thing needed would be to modify PageRank in order to take into account the preference vector. This means all random jumps get back to the source of the RWR. In practice, in org/apache/giraph/examples/SimplePageRankVertex.java the new vertex value is computed as: DoubleWritable vertexValue = new DoubleWritable((0.15f / getNumVertices()) + 0.85f * sum); And the only thing I should do to implement the RWR is if ( myID == sourceID ) DoubleWritable vertexValue = new DoubleWritable((0.15f + 0.85f * sum); else DoubleWritable vertexValue = new DoubleWritable(0.85f * sum); Because all the random jumps converge on the single source. Am I correct or am I missing something? Cheers, -- Gianmarco
[Announcement] Giraph talk in Berlin on May 29th
Hi, I will give a talk titled Large Scale Graph Processing with Apache Giraph in Berlin on May 29th. Details are available at: https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 Best, Sebastian
Video of my talk
Introducing Apache Giraph for Large Scale Graph Processing http://vimeo.com/40737998
Re: Slides for my talk at the Berlin Hadoop Get Together
I'm not sure what we'll do at the workshop, maybe Claudio could give a presentation of his excellent prezi slides? Best, Sebastian On 19.04.2012 22:07, Avery Ching wrote: Very nice! Will these be similar to the 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords? It would be good to add at leaset one of them to the page. Avery On 4/19/12 12:31 PM, Sebastian Schelter wrote: Here are the slides of my talk Introducing Apache Giraph for Large Scale Graph Processing at the Berlin Hadoop Get Together yesterday: http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing I reused a lot of stuff from Claudio's excellent prezi presentation. Best, Sebastian
Re: Slides for my talk at the Berlin Hadoop Get Together
The benchmark is part of a paper which my colleagues have recently submitted. Once it's published, I'll ask them to describe their benchmark results on this list. On 19.04.2012 22:20, Claudio Martella wrote: I like it, very clear. I think it would be nice to add the one with some comparison with Stratosphere, it would be a nice add-on. Do you have any update on the benchmark between Giraph and stratosphere you mentioned on twitter a while ago? Looking forward to meet you guys in Berlin. On Thu, Apr 19, 2012 at 10:07 PM, Avery Ching ach...@apache.org wrote: Very nice! Will these be similar to the 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords? It would be good to add at leaset one of them to the page. Avery On 4/19/12 12:31 PM, Sebastian Schelter wrote: Here are the slides of my talk Introducing Apache Giraph for Large Scale Graph Processing at the Berlin Hadoop Get Together yesterday: http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing I reused a lot of stuff from Claudio's excellent prezi presentation. Best, Sebastian
Slides for my talk at the Berlin Hadoop Get Together
Here are the slides of my talk Introducing Apache Giraph for Large Scale Graph Processing at the Berlin Hadoop Get Together yesterday: http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing I reused a lot of stuff from Claudio's excellent prezi presentation. Best, Sebastian
Announcement: 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords
Hi everybody, I'd like to announce the 'Parallel Processing beyond MapReduce' workshop which will take place directly after the Berlin Buzzwords conference ( http://berlinbuzzwords.de/ ). This workshop will discuss novel paradigms for parallel processing beyond the traditional MapReduce paradigm offered by Apache Hadoop. The workshop will introduce two new systems: Apache Giraph aims at processing large graphs, runs on standard Hadoop infrastructure and is a loose port of Google's Pregel system. Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Stratosphere (http://www.stratosphere.eu) is a system that is developed in a joint research project by Technische Universität Berlin, Humboldt Universität zu Berlin and the Hasso-Plattner-Institut in Potsdam. It is a database inspired, large-scale data processor based on concepts of robust and adaptive execution. Stratosphere offers the PACT programming model that extends the MapReduce programming model with additional second order functions. As execution platform it uses the Nephele system, a massively parallel data flow engine which is also researched and developed in the project. Attendees will hear about the new possibilities of Hadoop's NextGen MapReduce architecture (YARN) and get a detailed introduction to the Apache Giraph and Stratosphere systems. After that there will be plenty of time for questions, discussions and diving into source code. As a prerequisite, attendees have to bring a notebook with: - a copy of Giraph downloaded with source - Hadoop 0.23+ source tree and JARS local - a copy of Stratosphere with source - an IDE of their choice The workshop will take place on the 6th and 7th of June and is limited to 15 attendees. Please register by sending an email to sebastian [DOT] schelter [AT] tu-berlin [DOT] de http://berlinbuzzwords.de/content/workshops-berlin-buzzwords /s
Announcement: Giraph talk at the next Hadoop Get Together in Berlin
This is to announce the next Apache Hadoop Get-Together in Berlin When: 18. April, 18 p.m. Where: Immobilien Scout GmbH, Andreasstr. 10, 10243 Berlin As always there will be slots of 30min each for talks on your Hadoop topic. After each talk there will be time for discussion. We’ll then head over to a bar after the event for some beer and something to eat. It is important to indicate attendance. Only registered visitors will be permitted to attend. Please register here or find more information: https://www.xing.com/events/hadoop-get-together-18-april-980654 Talks scheduled thus far: Speaker: Alexander Lorenz Session: FlumeNG After a serious time of running in large installation the flume team decided to rewrite flume completely. In this session Alex will show you the new stack, what we changed and why. FlumeNG is a apache incubation project and licensed to Apache License v2. Bio: Alex is an hadoop / noSQL Engineer working within Cloudera's Customer Operations Engineering team. Within the hadoop eco system, Alex specializes in sqoop and flume. He also has worked, since the late 80's, on various open source projects including Linux. He blogs - and sometimes gives presentations - from Germany to evangelize Hadoop. Speaker: Sebastian Schelter Session: Introducing Apache Giraph for Large Scale Graph Processing Web and online social graphs have been rapidly growing in size and scale during the past decade. Processing such graphs which consist of millions of vertices and hundreds of millions to billions of edges is a huge technical challenge. Experience shows that the Map/Reduce paradigm is no good fit for working with large graph structures. This talk will give an introduction to Apache Giraph, a system that implements the bulk-synchronous parallel (BSP) model in a way that is extremely well suited for implementing graph algorithms. Giraph is a very young project which is currently developed by people working at leading internet platforms such as Facebook, Twitter, LinkedIn and can be run on a standard Hadoop infrastructure. Sebastian Schelter is a PhD student from the Database Systems and Information Management Group (DIMA) of TU Berlin. He is also committer and PMC member of Apache Mahout and Apache Giraph. /s
Re: IPC error
This looks like the code tries to invoke a no-arg constructor of ArrayWritable (which doesn't exist AFAIK). Do you use a custom writable? It needs to have a working no-arg constructor. --sebastian 2012/2/21 Avery Ching ach...@apache.org: I wonder if Giraph is compiled with the correct Hadoop version? What is the Hadoop version you're using? Avery On 2/20/12 3:41 PM, David Garcia wrote: Hello all, I'm getting this error and I'm pretty sure that I have my class path stuff setup right. . .maybe not. ERROR org.apache.giraph.comm.BasicRPCCommunications: org.apache.hadoop.ipc.RemoteException: IPC server unable to read call parameters: java.lang.NoSuchMethodException: org.apache.hadoop.io.ArrayWritable.init() Any ideas? This is on super step 0. -David
Re: IntIntNullIntVertex initialize method
That's a good point, can you open a Jira issue for that? Maybe we should use some annotation like @Nullable to mark parameters that can be null to make it easier to avoid forgetting checks in the implementing code. --sebastian On 19.02.2012 19:47, yavuz gokirmak wrote: Hello, I am writing my own giraph algorithms to understand the system, I have again a question about IntIntNullIntVertex I think IntIntNullIntVertex.initialize method should handle null edges argument because in VertexResolver.resolve method (line:91) vertex.initialize is called with edge argument as null. Finally I wrote my own simple graph algorithm and able to run over girap. Sorry for newbie questions. - ygokirmak