Hi,
I did same think in two M/R jobs during preprocesing - it was pretty
powerful for web graphs but little bit slow.
Solution for Giraph is:
1. Implement own partition which will iterate vertices in order. Use
appropriate partitioner.
2. During first iteration you need to rename vertexes in each partition
without holes. Holes will be only between partitions.
At the end, get min and max vertex index for each partion, send it
to master in aggregator and compute mapping required to delete holes.
3. During second iteration iterate all vertexes and delete holes by
shifting vertex indexes.
4. .... rename edges (two more iterations)...
Btw: Why do you need such indexes ? For HLL ?
Lukas
On 15.4.2014 15:33, Martin Neumann wrote:
Hej,
I have a huge edgelist (several billion edges) where node ID's are URL's.
The algorithm I want to run needs the ID's to be long and there should
be no holes in the ID space (so I cant simply hash the URL's).
Is anyone aware of a simple solution that does not require a
impractical huge hash map?
My idea currently is to load the graph into another giraph job and
then assigning a number to each node. This way the mapping of number
to URL would be stored in the Node.
Problem is that I have to assign the numbers in a sequential way to
ensure there are no holes and numbers are unique. No Idea if this is
even possible in Giraph.
Any input is welcome
cheers Martin