It totally depends on the input distribution, one very simple thing that can be
done is:> Define a VertexResolver that upon every vertex creation sets its Id
= domain of url & value = "set" of urls in the domain; it keeps appending as
more vertices with same id (i.e., domain) are read from input> [Now you can
ignore edges all together. All you are left with is these huge-vertices that
are identified by domains & contain value = set of urls] Here, you can use
aggregator approach of sending in the (domain, count of set) to master -> these
aggregators are then combined to give something like [(domain1, offset1),
(domain2, offset2), etc.] all vertices (the huge ones) read this aggregator and
figure out their offset then while u output just output the vertices in the set
with number = offset + number in set
So u have a map now -> though it is highly unstable because adding one more url
to a domain later will change the order totally, that is when u can use id =
domain + insert date, etc. [[which will stop working at some point because
aggregator needs to carry huge messages, then the computation of offsets via
aggregators needs to be done in multiple supersteps, etc.]]
Anyway now that you have the map of url - numberAll you got to do is a join
->that's simpleread your original table + this map table in a single giraph
joband u can use 2 supersteps to rename all the vertices properly
[[note that you can do another thing here as well, MUCH SIMPLER THAN ABOVE]>
superstep 0in your compute class have a thread local variable that increases
for each vertex the thread computes, assign the value [(workerid,threadid) ,
number] to each vertex.
now aggregate {(workerid, threadid), number}
> superstep 1;master>now we have see [{(workerid, threadid), count in group}]so
> recompute another aggregator which is like[{workerid, threadid), cumulative
> sum upto now]send this aggregator to workers
workerread cumulative_sum from aggregator and add it up to each vertex's
current value
when you output the graph this time as edgeoutput, sourceid, targetid are set
as vertex values = the count
Date: Tue, 15 Apr 2014 23:40:39 +0200
Subject: Re: Changing index of a graph
From: [email protected]
To: [email protected]
I have a pipeline that creates a graph then does some transformations on it
(with Giraph). In the end I want to dump it into Neo4j to allow for cypher
queries.
I was told that I could make the batch import for Neo4j a lot faster if I would
use Long identifiers without holes, and therefore matching there internal ID
space. If I understand it right they use it to build an on disk index with it
using the ID's as offsets, that's why it should have no holes.
I didn't expect it to be so costly to change the index, but I guess this way I
could at least spread the load to the cluster, since batch import happens on a
single machine.
Thanks 4 the input, I will see what makes the most sense with the limited time
I have.
On Tue, Apr 15, 2014 at 5:31 PM, Lukas Nalezenec
<[email protected]> wrote:
Hi,
I did same think in two M/R jobs during preprocesing - it was
pretty powerful for web graphs but little bit slow.
Solution for Giraph is:
1. Implement own partition which will iterate vertices in order.
Use appropriate partitioner.
2. During first iteration you need to rename vertexes in each
partition without holes. Holes will be only between partitions.
At the end, get min and max vertex index for each partion,
send it to master in aggregator and compute mapping required to
delete holes.
3. During second iteration iterate all vertexes and delete holes
by shifting vertex indexes.
4. .... rename edges (two more iterations)...
Btw: Why do you need such indexes ? For HLL ?
Lukas
On 15.4.2014 15:33, Martin Neumann wrote:
Hej,
I have a huge edgelist (several billion edges) where node
ID's are URL's.
The algorithm I want to run needs the ID's to be long and
there should be no holes in the ID space (so I cant simply
hash the URL's).
Is anyone aware of a simple solution that does not require
a impractical huge hash map?
My idea currently is to load the graph into another giraph
job and then assigning a number to each node. This way the
mapping of number to URL would be stored in the Node.
Problem is that I have to assign the numbers in a
sequential way to ensure there are no holes and numbers are
unique. No Idea if this is even possible in Giraph.
Any input is welcome
cheers Martin