Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

2012-05-18 Thread Sebastian Schelter
Accumulation should be possible with aggregators and the idea with
applying preference vectors seems very interesting too.

Maybe these should be implemented as additional features in GIRAPH-191?

On 18.05.2012 11:24, Gianmarco De Francisci Morales wrote:
 Could you just accumulate the PR of dangling nodes in a variable and divide
 it equally in the next iteration?
 If you use a preference vector instead of dividing it equally, then you
 have also the strongly preferential (and weakly preferential in case you
 divide equally) versions of RWR.
 
 Cheers,
 --
 Gianmarco
 
 
 
 
 On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna 
 castagna.li...@googlemail.com wrote:
 
 Sebastian Schelter wrote:
 You are right, the code would throw an exception here, but I guess it
 would be sufficient to simply add a check here and not send any messages.

 That would avoid throwing an exception, but I am not sure it is actually
 the best thing to do.
 Some suggest to divide the PageRank scores of dangling nodes evenly among
 all other pages.
 If you want, this is equivalent to another random jump, but forced by the
 fact that there are no outgoing links to select from.
 Using your notation below:

  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
 sumOfDanlingNodesRanks / getNumVertices() )

 See also pag. 38 of the book Google's PageRank and Beyond:
 http://books.google.co.uk/books?id=hxvB14-I0twCpg=PA38

 In other works, in pseudo code a serial implementation which accounts for
 dangling nodes should be:

 
 dangling_nodes_ranks = 0
 foreach node in dangling_node
dangling_nodes_ranks += ranks.get ( node )
 dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;

 foeach node1 in graph
r = 0
foreach node2 in incoming_links ( node1 )
r += ranks.get ( node2 ) / number_outgoing_links ( node2 )

 r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor ) / N
 

 What do you think?

 Paolo


 --sebastian


 On 17.05.2012 23:44, Paolo Castagna wrote:
 Hi Sebastian,
 from SimplePageRankVertex.java, we have:

 if (getSuperstep()  MAX_SUPERSTEPS) {
   long edges = getNumOutEdges();
   sendMsgToAllEdges(
   new DoubleWritable(getVertexValue().get() / edges));
 } else {
   voteToHalt();
 }

 What happens if getNumOutEdges() returns 0 (i.e. it's a dangling node)?

 Paolo

 Sebastian Schelter wrote:
 The implementation implicitly deals with dangling nodes by computing
 the
 pageRank as

 0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks

 Conceptually, this is the same as connecting every vertex with every
 other vertex it is not yet connected to. This removes all dangling
 nodes
 from the graph as every vertex can reach every other vertex.

 Mining of Massive Datasets [1] has a very well written explanation of
 this approach.

 --sebastian


 [1] http://infolab.stanford.edu/~ullman/mmds.html

 On 17.05.2012 16:23, Paolo Castagna wrote:
 Hi,
 I noticed that the SimplePageRankVertex implementation does not deal
 with
 dangling nodes (dangling nodes are those nodes with no outgoing
 links).
 And, probably that is one of the good reasons why there is a Simple
 in front. ;-)

  Dangling nodes exist in many forms. For example, a page of data, a
 page with a
 postscript graph, [...], a page that has been fetched by a crawler
 but not yet
 explored - these are all examples of possible dangling nodes. The
 more ambitious
 the crawl, the bigger the proportion of dangling nodes because the
 set of
 fetched but uncrawled pages grows quickly. (pag. 80) [1]

 Wikipedia's PageRank page says:

  If a page has no links to other pages, it becomes a sink and
 therefore
 terminates the random surfing process. If the random surfer arrives
 at a sink
 page, it picks another URL at random and continues surfing again. When
 calculating PageRank, pages with no outbound links are assumed to
 link out to
 all other pages in the collection. Their PageRank scores are
 therefore divided
 evenly among all other pages. In other words, to be fair with pages
 that are not
 sinks, these random transitions are added to all nodes in the Web,
 with a
 residual probability usually set to d = 0.85, estimated from the
 frequency that
 an average surfer uses his or her browser's bookmark feature. [2]

 Now, if I want to try to account for the dangling nodes I need to sum
 all
 PageRank values for the dangling nodes and divide it evenly among all
 other
 pages (which in practce means to send a message to all vertexes in
 Giraph...
 which is not possible, as it would be crazy with large graphs).

 I imagine one can use a SumAggregator and alternate computing
 contribution from
 dangling nodes and PageRank values (i.e. PageRank values are computed
 only every
 2 supersteps).

 What do you think?

 Paolo

  [1] Amy N. Langville and Carl D. Meyer, Google's PageRank and
 Beyond: The
 Science of Search Engine Rankings (2006)
  [2] http://en.wikipedia.org/wiki/PageRank


 



Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...

2012-05-18 Thread Sebastian Schelter
Hi Paolo,

Very good points, could you add them to GIRAPH-191?

I think that the convergence check does not need an extra superstep. It
is sufficient to compute the L1-norm of the current pageRank vector,
which can be done by an aggregator and compare it to the previous
aggregated value.

I guess that summing up and redistributing the pagerank of dangling
vertices can also be done without an extra superstep in an aggregator.

--sebastian

On 18.05.2012 12:31, Paolo Castagna wrote:
 Hi Gianmarco
 
 Gianmarco De Francisci Morales wrote:
 Could you just accumulate the PR of dangling nodes in a variable and
 divide it equally in the next iteration?
 
 Yep, this is exactly what I was thinking when I wrote: I imagine one can use 
 a SumAggregator and alternate computing contribution from dangling nodes and 
 PageRank values (i.e. PageRank values are
 computed only every 2 supersteps).
 
 You need to use an Aggregator and spend a superstep just to aggregate/sum the 
 PageRank of all danling nodes.
 Then, in the successive superstep you can use that divided by the number of 
 nodes in the graph to equally among all other nodes (i.e. the 
 sumOfDanlingNodesRanks / getNumVertices() below).
 
 Another, change/enhancement could be to check for convergence (to do that, 
 you need to keep current and previous PageRank values) and spend another 
 superstep just doing that.
 But, as you can see things stop being simple and the point of 
 SimplePageRankVertex is to offer a very simple example.
 I did not started this thread with the aim to change SimplePageRankVertex, I 
 think it should stay as it is (and as the Google's Pregel paper describes it).
 
 However, in practice, if you want to run PageRank or similar, you need to 
 deal with problems such as: what to do with dangling nodes/pages? how to know 
 how many iterations you need to reach the
 precision you want/need? Etc.
 
 My initial message was to exchange some ideas and get some feedback and/or 
 suggestions on how a not too simple PageRankVertex might look. ;-)
 
 If you use a preference vector instead of dividing it equally, then you
 have also the strongly preferential (and weakly preferential in case you
 divide equally) versions of RWR.
 
 Yep, but how you get your preference vector? ;-)
 
 As a first step, it would be great if we can come up with a PageRankVertex 
 which takes into account dangling nodes and perhaps/optionally check for 
 convergence every few supersteps.
 
 Anyway, thanks for your comments and feedback Gianmarco.
 
 Paolo
 

 Cheers,
 --
 Gianmarco




 On Fri, May 18, 2012 at 12:50 AM, Paolo Castagna
 castagna.li...@googlemail.com mailto:castagna.li...@googlemail.com
 wrote:

 Sebastian Schelter wrote:
  You are right, the code would throw an exception here, but I guess it
  would be sufficient to simply add a check here and not send any
 messages.

 That would avoid throwing an exception, but I am not sure it is
 actually the best thing to do.
 Some suggest to divide the PageRank scores of dangling nodes evenly
 among all other pages.
 If you want, this is equivalent to another random jump, but forced
 by the fact that there are no outgoing links to select from.
 Using your notation below:

  0.15f / getNumVertices() + 0.85f ( sumOfNeighborRanks +
 sumOfDanlingNodesRanks / getNumVertices() )

 See also pag. 38 of the book Google's PageRank and Beyond:
 http://books.google.co.uk/books?id=hxvB14-I0twCpg=PA38
 http://books.google.co.uk/books?id=hxvB14-I0twCpg=PA38

 In other works, in pseudo code a serial implementation which
 accounts for dangling nodes should be:

 
 dangling_nodes_ranks = 0
 foreach node in dangling_node
dangling_nodes_ranks += ranks.get ( node )
 dangling_nodes_ranks = ( dumping_factor * dangling_nodes_ranks ) / N ;

 foeach node1 in graph
r = 0
foreach node2 in incoming_links ( node1 )
r += ranks.get ( node2 ) / number_outgoing_links ( node2 )

 r = dumping_factor * r + dangling_nodes_ranks + ( 1 - dumping_factor
 ) / N
 

 What do you think?

 Paolo

 
  --sebastian
 
 
  On 17.05.2012 23:44, Paolo Castagna wrote:
  Hi Sebastian,
  from SimplePageRankVertex.java, we have:
 
  if (getSuperstep()  MAX_SUPERSTEPS) {
long edges = getNumOutEdges();
sendMsgToAllEdges(
new DoubleWritable(getVertexValue().get() / edges));
  } else {
voteToHalt();
  }
 
  What happens if getNumOutEdges() returns 0 (i.e. it's a dangling
 node)?
 
  Paolo
 
  Sebastian Schelter wrote:
  The implementation implicitly deals with dangling nodes by
 computing the
  pageRank as
 
  0.15f / getNumVertices() + 0.85f * sumOfNeighborRanks
 
  Conceptually, this is the same as connecting every vertex with every
  other vertex

Re: RWR on Giraph

2012-05-17 Thread Sebastian Schelter
Hi,

you are completely right. I also started implementing RWR today
coincidently, could you file a JIRA ticket for RWR? I would attach my
work done so far and we could work a little on the code.

I think that a lot of things need improvement in the PageRank/RWR
implementation, e.g. stuff like the teleportation probability should be
configurable. Furthermore you shouldn't have to specify the number of
supersteps that need to be executed, but convergence should be checked
somehow via an aggregator.


Best,
Sebastian


On 17.05.2012 22:58, Gianmarco De Francisci Morales wrote:
 Hi Giraphers,
 
 I am implementing a Random Walk with Restart on Giraph.
 As far as I have understood, the only thing needed would be to modify
 PageRank in order to take into account the preference vector.
 This means all random jumps get back to the source of the RWR.
 In practice, in org/apache/giraph/examples/SimplePageRankVertex.java the
 new vertex value is computed as:
 
   DoubleWritable vertexValue = new DoubleWritable((0.15f /
 getNumVertices()) + 0.85f * sum);
 
 And the only thing I should do to implement the RWR is
 
 if ( myID == sourceID )
   DoubleWritable vertexValue = new DoubleWritable((0.15f + 0.85f * sum);
 else
   DoubleWritable vertexValue = new DoubleWritable(0.85f * sum);
 
 Because all the random jumps converge on the single source.
 Am I correct or am I missing something?
 
 Cheers,
 --
 Gianmarco
 



[Announcement] Giraph talk in Berlin on May 29th

2012-05-12 Thread Sebastian Schelter
Hi,

I will give a talk titled Large Scale Graph Processing with Apache
Giraph in Berlin on May 29th. Details are available at:

https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275

Best,
Sebastian


Video of my talk

2012-04-23 Thread Sebastian Schelter
Introducing Apache Giraph for Large Scale Graph Processing

http://vimeo.com/40737998


Re: Slides for my talk at the Berlin Hadoop Get Together

2012-04-21 Thread Sebastian Schelter
I'm not sure what we'll do at the workshop, maybe Claudio could give a
presentation of his excellent prezi slides?

Best,
Sebastian


On 19.04.2012 22:07, Avery Ching wrote:
 Very nice!  Will these be similar to the 'Parallel Processing beyond
 MapReduce' workshop after Berlin Buzzwords?  It would be good to add at
 leaset one of them to the page.
 
 Avery
 
 On 4/19/12 12:31 PM, Sebastian Schelter wrote:
 Here are the slides of my talk Introducing Apache Giraph for Large
 Scale Graph Processing at the Berlin Hadoop Get Together yesterday:

 http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing


 I reused a lot of stuff from Claudio's excellent prezi presentation.

 Best,
 Sebastian
 



Re: Slides for my talk at the Berlin Hadoop Get Together

2012-04-21 Thread Sebastian Schelter
The benchmark is part of a paper which my colleagues have recently
submitted. Once it's published, I'll ask them to describe their
benchmark results on this list.




On 19.04.2012 22:20, Claudio Martella wrote:
 I like it, very clear.
 I think it would be nice to add the one with some comparison with
 Stratosphere, it would be a nice add-on. Do you have any update on the
 benchmark between Giraph and stratosphere you mentioned on twitter a
 while ago?
 
 Looking forward to meet you guys in Berlin.
 
 On Thu, Apr 19, 2012 at 10:07 PM, Avery Ching ach...@apache.org wrote:
 Very nice!  Will these be similar to the 'Parallel Processing beyond
 MapReduce' workshop after Berlin Buzzwords?  It would be good to add at
 leaset one of them to the page.

 Avery


 On 4/19/12 12:31 PM, Sebastian Schelter wrote:

 Here are the slides of my talk Introducing Apache Giraph for Large
 Scale Graph Processing at the Berlin Hadoop Get Together yesterday:


 http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing

 I reused a lot of stuff from Claudio's excellent prezi presentation.

 Best,
 Sebastian


 
 
 



Slides for my talk at the Berlin Hadoop Get Together

2012-04-19 Thread Sebastian Schelter
Here are the slides of my talk Introducing Apache Giraph for Large
Scale Graph Processing at the Berlin Hadoop Get Together yesterday:

http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing

I reused a lot of stuff from Claudio's excellent prezi presentation.

Best,
Sebastian


Announcement: 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords

2012-04-04 Thread Sebastian Schelter
Hi everybody,

I'd like to announce the 'Parallel Processing beyond MapReduce' workshop
which will take place directly after the Berlin Buzzwords conference (
http://berlinbuzzwords.de/ ).


This workshop will discuss novel paradigms for parallel processing
beyond the traditional MapReduce paradigm offered by Apache Hadoop.

The workshop will introduce two new systems:

Apache Giraph aims at processing large graphs, runs on standard Hadoop
infrastructure and is a loose port of Google's Pregel system. Giraph
follows the bulk-synchronous parallel model relative to graphs where
vertices can send messages to other vertices during a given superstep.

Stratosphere (http://www.stratosphere.eu) is a system that is developed
in a joint research project by Technische Universität Berlin, Humboldt
Universität zu Berlin and the Hasso-Plattner-Institut in Potsdam. It is
a database inspired, large-scale data processor based on concepts of
robust and adaptive execution. Stratosphere offers the PACT programming
model that extends the MapReduce programming model with additional
second order functions. As execution platform it uses the Nephele
system, a massively parallel data flow engine which is also researched
and developed in the project.

Attendees will hear about the new possibilities of Hadoop's NextGen
MapReduce architecture (YARN) and get a detailed introduction to the
Apache Giraph and Stratosphere systems. After that there will be plenty
of time for questions, discussions and diving into source code.

As a prerequisite, attendees have to bring a notebook with:
 - a copy of Giraph downloaded with source
 - Hadoop 0.23+ source tree and JARS local
 - a copy of Stratosphere with source
 - an IDE of their choice

The workshop will take place on the 6th and 7th of June and is limited
to 15 attendees. Please register by sending an email to sebastian [DOT]
schelter [AT] tu-berlin [DOT] de

http://berlinbuzzwords.de/content/workshops-berlin-buzzwords


/s


Announcement: Giraph talk at the next Hadoop Get Together in Berlin

2012-04-04 Thread Sebastian Schelter
This is to announce the next Apache Hadoop Get-Together in Berlin

When: 18. April, 18 p.m.
Where: Immobilien Scout GmbH, Andreasstr. 10, 10243 Berlin

As always there will be slots of 30min each for talks on your Hadoop
topic. After each talk there will be time for discussion. We’ll then
head over to a bar after the event for some beer and something to eat.

It is important to indicate attendance. Only registered visitors will be
permitted to attend.

Please register here or find more information:
https://www.xing.com/events/hadoop-get-together-18-april-980654

Talks scheduled thus far:

Speaker: Alexander Lorenz

Session:
FlumeNG
After a serious time of running in large installation the flume team
decided to rewrite flume completely. In this session Alex will show you
the new stack, what we changed and why. FlumeNG is a apache incubation
project and licensed to Apache License v2.

Bio:
Alex is an hadoop / noSQL Engineer working within Cloudera's Customer
Operations Engineering team. Within the hadoop eco system, Alex
specializes in sqoop and flume. He also has worked, since the late 80's,
on various open source projects including Linux. He blogs - and
sometimes gives presentations - from Germany to evangelize Hadoop.

Speaker: Sebastian Schelter

Session:
Introducing Apache Giraph for Large Scale Graph Processing

Web and online social graphs have been rapidly growing in size and scale
during the past decade. Processing such graphs which consist of millions
of vertices and hundreds of millions to billions of edges is a huge
technical challenge. Experience shows that the Map/Reduce paradigm is no
good fit for working with large graph structures. This talk will give an
introduction to Apache Giraph, a system that implements the
bulk-synchronous parallel (BSP) model in a way that is extremely well
suited for implementing graph algorithms. Giraph is a very young project
which is currently developed by people working at leading internet
platforms such as Facebook, Twitter, LinkedIn and can be run on a
standard Hadoop infrastructure.

Sebastian Schelter is a PhD student from the Database Systems and
Information Management Group (DIMA) of TU Berlin. He is also committer
and PMC member of Apache Mahout and Apache Giraph.

/s


Re: IPC error

2012-02-20 Thread Sebastian Schelter
This looks like the code tries to invoke a no-arg constructor of
ArrayWritable (which doesn't exist AFAIK). Do you use a custom
writable? It needs to have a working no-arg constructor.

--sebastian

2012/2/21 Avery Ching ach...@apache.org:
 I wonder if Giraph is compiled with the correct Hadoop version?  What is the
 Hadoop version you're using?

 Avery


 On 2/20/12 3:41 PM, David Garcia wrote:

 Hello all,  I'm getting this error and I'm pretty sure that I have my
 class path stuff setup right. . .maybe not.

 ERROR org.apache.giraph.comm.BasicRPCCommunications:
 org.apache.hadoop.ipc.RemoteException: IPC server unable to read call
 parameters: java.lang.NoSuchMethodException:
 org.apache.hadoop.io.ArrayWritable.init()
 Any ideas?  This is on super step 0.
 -David




Re: IntIntNullIntVertex initialize method

2012-02-19 Thread Sebastian Schelter
That's a good point, can you open a Jira issue for that? Maybe we should
use some annotation like @Nullable to mark parameters that can be null
to make it easier to avoid forgetting checks in the implementing code.

--sebastian



On 19.02.2012 19:47, yavuz gokirmak wrote:
 Hello,
 
 I am writing my own giraph algorithms to understand the system,
 I have again a question about IntIntNullIntVertex
 
 I think IntIntNullIntVertex.initialize method should handle null edges
 argument
 because in VertexResolver.resolve method (line:91) vertex.initialize is
 called with edge argument as null.
 
 Finally I wrote my own simple graph algorithm and able to run over girap.
 Sorry for newbie questions.
 
 - ygokirmak