Re: PageRankBenchmrk fails due to IllegalStateException

Avery Ching Sat, 03 Dec 2011 00:28:35 -0800

Thanks for letting us know the issue. Glad you figured out the/etc/hosts issue. I didn't think of that one.


Avery


On 12/2/11 11:47 PM, Inci Cetindil wrote:

Hi Avery,

I finally succeeded running the benchmark. The problem was not the port; but 
the IP resolving.

After removing the mapping from 127.0.0.1  to the node names on /etc/hosts 
files, it worked like a charm!  I guess Hadoop has different code path to get 
what IP it should listen on; so normal Hadoop jobs worked with the previous 
network configuration.

Thanks for your help!
Inci

On Dec 2, 2011, at 11:06 AM, Avery Ching wrote:

You can actually set the starting RPC port to change it from 30000 by adding the 
appropriate configuration  (i.e. hadoop jar giraph-0.70-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark  -Dgiraph.rpcInitialPort=<your 
starting port>  -e 1 -s 3 -v -V 500 -w 5).

I think I would ensure that those ports are open for communication between on 
node in your cluster to another .  I don't think that anyone else has run into 
this problem yet...

Since the job does take some time to fail, you might want to start it up and 
then try to telnet to its rpc port from another machine in the cluster and see 
if that succeeds.

Hope that helps,

Avery

On 12/1/11 11:04 PM, Inci Cetindil wrote:

I have tried it with various numbers of workers and it only worked with 1 
worker.

I am not running multiple Giraph jobs at the same time, does it always use the ports 
30000 and up? I checked the used ports using "netstat" command and didn't see 
any of the ports 30000-30005.

Inci

On Dec 1, 2011, at 7:03 PM, Avery Ching wrote:

Hmmm...this is unusual.  I wonder if it is tired to the weird number of tasks 
you are getting.  Can you try it with various numbers of workers (i.e. 1, 2) 
and see if it works?

To me, the connection refused error indicates that perhaps the server failed to 
bind to its port (are you running multiple Giraph jobs at the same time) or the 
server died?

Avery

On 12/1/11 5:33 PM, Inci Cetindil wrote:

I am sure the machines can communicate to each other and the ports are not 
blocked. I can run word count hadoop job without any problem on these machines. 
My hadoop version is 0.20.203.0.

Thanks,
Inci

On Dec 1, 2011, at 3:57 PM, Avery Ching wrote:

Thanks for the logs.  I see a lot of issues like the following:

2011-12-01 00:04:46,241 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 0 time(s).
2011-12-01 00:04:47,243 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 1 time(s).
2011-12-01 00:04:48,245 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 2 time(s).
2011-12-01 00:04:49,247 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 3 time(s).
2011-12-01 00:04:50,249 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 4 time(s).
2011-12-01 00:04:51,251 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 5 time(s).
2011-12-01 00:04:52,253 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 6 time(s).
2011-12-01 00:04:53,255 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 7 time(s).
2011-12-01 00:04:54,256 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 8 time(s).
2011-12-01 00:04:55,258 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: rainbow-01/192.168.100.1:30004. Already tried 9 time(s).
2011-12-01 00:04:55,261 WARN org.apache.giraph.comm.BasicRPCCommunications: 
connectAllRPCProxys:     Failed on attempt 0 of 5 to connect to 
(id=0,cur=Worker(hostname=rainbow-01, MRpartition=4, 
port=30004),prev=null,ckpt_file=null)
java.net.ConnectException: Call to rainbow-01/192.168.100.1:30004 failed on 
connection exception: java.net.ConnectException: Connection refused

Are you sure that your machines can communicate to each other?  Are the ports 
30000 and up blocked?  And you're right, you should have only had 6 tasks.  
What version of Hadoop is this on?

Avery

On 12/1/11 2:43 PM, Inci Cetindil wrote:

Hi Avery,

I attached the logs for the first attemps. The weird thing is even if I 
specified the number of workers as 5, I had 8 mapper tasks. You can see the 
logs for tasks 6 and 7 failed immediately. Do you have any explanation for this 
behavior? Normally I should have 6 tasks, right?

Thanks,
Inci




On Dec 1, 2011, at 11:00 AM, Avery Ching wrote:

Hi Inci,

I am not sure what's wrong.  I ran the exact same command on a freshly checked 
version of Graph without any trouble.  Here's my output:

hadoop jar target/giraph-0.70-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 500 -w 5
Using org.apache.giraph.benchmark.PageRankBenchmark$PageRankVertex
11/12/01 10:58:05 WARN bsp.BspOutputFormat: checkOutputSpecs: 
ImmutableOutputCommiter will not check anything
11/12/01 10:58:05 INFO mapred.JobClient: Running job: job_201112011054_0003
11/12/01 10:58:06 INFO mapred.JobClient:  map 0% reduce 0%
11/12/01 10:58:23 INFO mapred.JobClient:  map 16% reduce 0%
11/12/01 10:58:35 INFO mapred.JobClient:  map 100% reduce 0%
11/12/01 10:58:40 INFO mapred.JobClient: Job complete: job_201112011054_0003
11/12/01 10:58:40 INFO mapred.JobClient: Counters: 31
11/12/01 10:58:40 INFO mapred.JobClient:   Job Counters
11/12/01 10:58:40 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=77566
11/12/01 10:58:40 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
11/12/01 10:58:40 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
11/12/01 10:58:40 INFO mapred.JobClient:     Launched map tasks=6
11/12/01 10:58:40 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
11/12/01 10:58:40 INFO mapred.JobClient:   Giraph Timers
11/12/01 10:58:40 INFO mapred.JobClient:     Total (milliseconds)=13468
11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 3 (milliseconds)=41
11/12/01 10:58:40 INFO mapred.JobClient:     Setup (milliseconds)=11691
11/12/01 10:58:40 INFO mapred.JobClient:     Shutdown (milliseconds)=73
11/12/01 10:58:40 INFO mapred.JobClient:     Vertex input superstep 
(milliseconds)=369
11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 0 (milliseconds)=674
11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 2 (milliseconds)=519
11/12/01 10:58:40 INFO mapred.JobClient:     Superstep 1 (milliseconds)=96
11/12/01 10:58:40 INFO mapred.JobClient:   Giraph Stats
11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate edges=500
11/12/01 10:58:40 INFO mapred.JobClient:     Superstep=4
11/12/01 10:58:40 INFO mapred.JobClient:     Last checkpointed superstep=2
11/12/01 10:58:40 INFO mapred.JobClient:     Current workers=5
11/12/01 10:58:40 INFO mapred.JobClient:     Current master task partition=0
11/12/01 10:58:40 INFO mapred.JobClient:     Sent messages=0
11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate finished vertices=500
11/12/01 10:58:40 INFO mapred.JobClient:     Aggregate vertices=500
11/12/01 10:58:40 INFO mapred.JobClient:   File Output Format Counters
11/12/01 10:58:40 INFO mapred.JobClient:     Bytes Written=0
11/12/01 10:58:40 INFO mapred.JobClient:   FileSystemCounters
11/12/01 10:58:40 INFO mapred.JobClient:     FILE_BYTES_READ=590
11/12/01 10:58:40 INFO mapred.JobClient:     HDFS_BYTES_READ=264
11/12/01 10:58:40 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=129240
11/12/01 10:58:40 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=55080
11/12/01 10:58:40 INFO mapred.JobClient:   File Input Format Counters
11/12/01 10:58:40 INFO mapred.JobClient:     Bytes Read=0
11/12/01 10:58:40 INFO mapred.JobClient:   Map-Reduce Framework
11/12/01 10:58:40 INFO mapred.JobClient:     Map input records=6
11/12/01 10:58:40 INFO mapred.JobClient:     Spilled Records=0
11/12/01 10:58:40 INFO mapred.JobClient:     Map output records=0
11/12/01 10:58:40 INFO mapred.JobClient:     SPLIT_RAW_BYTES=264


Would it be possible to send me the logs from the first attempts for every map 
task?

i.e. from
Task attempt_201111302343_0002_m_000000_0
Task attempt_201111302343_0002_m_000001_0
Task attempt_201111302343_0002_m_000002_0
Task attempt_201111302343_0002_m_000003_0
Task attempt_201111302343_0002_m_000004_0
Task attempt_201111302343_0002_m_000005_0

I think that could help us find the issue.

Thanks,

Avery

On 12/1/11 1:17 AM, Inci Cetindil wrote:

Hi,

I'm running PageRank benchmark example on a cluster with 1 master + 5 slave 
nodes. I have tried it with a large number of vertices; when I failed I decided 
to make it run with 500 vertices and 5 workers first.  However, it doesn't work 
even for 500 vertices.
I am using the latest version of Giraph from the trunk and running the 
following command:

hadoop jar giraph-0.70-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 500 -w 5

I attached the error message that I am receiving. Please let me know if I am 
missing something.

Best regards,
Inci

Re: PageRankBenchmrk fails due to IllegalStateException

Reply via email to