question about vertex instantiation location. . .

2012-02-10 Thread David Garcia
Hey guys. . .I have a questions about dynamic vertex instantiation vis
the sendMsg(. . .) method.  I have a job that starts processing on a
sequenceFile with only two vertices in it.  Each vertex has information in
it's value that tells it what vertices are adjacent to it.  The primary
reason I'm doing this is to avoid loading the entire graph into the job.
There are many vertices that won't do any processing (no need to load
them).  I would like to take my two vertices and dynamically build the
graph by sending messages.  So far, my experimentation shows that this is
promising. . .but I have a question WRT load balancing for new vertex
instantiation.  When I call sendMsg(newVertexID), where will the vertex be
instantiated?  If I specify 20 mappers (but with only two vertices in my
sequence file), obviously there is going to be at least one mapper without
a vertex.  Is it possible that sendMsg(newVertexID) will be instantiated
on an empty mapper?  I would like this. . .for load balancing purposes.

-david



Re: question about vertex instantiation location. . .

2012-02-10 Thread David Garcia
Awesome.  Thx so much for the info.  I'll let yall know how my testing
goes.

On 2/10/12 4:04 PM, Avery Ching ach...@apache.org wrote:

Even if you start with two vertices, the number of partitions is based
on the number of workers squared multiplied by a multiplier (see
HashMasterPartitioner#PARTITION_COUNT_MULTIPLIER).  By default, the
multiplier  is 1, so if you have say 10 workers, you'll have 100
partitions.  There is a maximum number of partitions though due to the
max zknode size of about 2995.  So everything should be fine for you.

Avery

On 2/10/12 1:52 PM, David Garcia wrote:
 Ah, so, I think I would like to balance by vertices.  My main question
is
 that my graph starts with two vertices. . .I would like to specify more
 than two mappers.  My job will end up creating around 100,000 vertices.
 I
 would like to make sure that these extra vertices will be evenly
 distributed across all mappers (including the ones that don't have the
 initial two vertices).  Does this make sense?  Does Giraph support this
 out of the box, or do I need to add something?  Thx.

 -David


 On 2/10/12 3:41 PM, Avery Chingach...@apache.org  wrote:

 By default, you are using the HashPartitionerFactory.  This will create
 the partitions ahead of time and balance them equally by count to the
 workers.  Therefore, assuming you have a uniform distribution across
the
 VertexId space, the graph should be balanced across the workers evenly
 according the number of vertices.  If you look at PartitionBalancer,
you
 can try to rebalance the graph if you like as it is running.  This is a
 bit experimental, but should work.  The choices for balancing are (no
 balancing, balance by edges or balance by vertices).

 Hope that helps,

 Avery


 On 2/10/12 1:25 PM, David Garcia wrote:
 Hey guys. . .I have a questions about dynamic vertex instantiation
vis
 the sendMsg(. . .) method.  I have a job that starts processing on a
 sequenceFile with only two vertices in it.  Each vertex has
information
 in
 it's value that tells it what vertices are adjacent to it.  The
primary
 reason I'm doing this is to avoid loading the entire graph into the
job.
 There are many vertices that won't do any processing (no need to load
 them).  I would like to take my two vertices and dynamically build
the
 graph by sending messages.  So far, my experimentation shows that this
 is
 promising. . .but I have a question WRT load balancing for new vertex
 instantiation.  When I call sendMsg(newVertexID), where will the
vertex
 be
 instantiated?  If I specify 20 mappers (but with only two vertices in
my
 sequence file), obviously there is going to be at least one mapper
 without
 a vertex.  Is it possible that sendMsg(newVertexID) will be
instantiated
 on an empty mapper?  I would like this. . .for load balancing
purposes.

 -david