Re: Benchmark runs but tests fail

2011-11-27 Thread Avery Ching

Inlining the response.  Sorry for the delay, been out a lot of today.

Avery

On 11/26/11 9:32 PM, Oana Theogarajan wrote:

Hi Avery,
thanks for the quick response.
About the unittests:

I was indeed specifying the wrong host:port
 1) the LocalJobRunner test (mvn test) works
 2) The test against the actual Hadoop instance (mvn test 
-Dprop.mapred.job.tracker=hdfs://ip-10-202-59-170.ec2.internal:50002) 
fails - they do execute, they assign maps etc, but the tests failed. 
The output is attached in the Testlogs.txt file. I am also attaching 
the job logs in case there is more info there that might be helpful to 
you.


About the PageRankBenchmark - I run the following command:
hadoop jar giraph-0.70-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 10 -v -V 100 
-w 8


It works fine - attached is the master task log for the successful 
case (mastersuccess.txt).
Then I test it by killing a tasktracker (I make sure it's not the one 
that runs the master-zookeeper task, I also make sure I'm passed 
superstep 2 so I can have a valid checkpoint). I am attaching the 
master task log as masterFailed.txt


Looks like the master is trying to start again from the last 
checkpoint, but it's waiting to have 8 running map tasks, which 
doesn't happen after I killed 2 of them.
("This occurs if you do not have enough map tasks available 
simultaneously on your Hadoop instance to fulfill the number of 
requested workers.")
 I was thinking the master would start more maps if it finds that some 
died. It looks like the master kills himself ? 


The way that Giraph works is by waiting until some minimum number of 
workers are available.  If that minimum is not met or some percent of 
the workers do not respond in time, them master will die and the job 
will fail.  So if you only have 8 map slots on the whole Hadoop instance 
and you permanently remove some, but the job is waiting for 8 maps to 
simultaneously be running, the job will fail.  Since everything is in 
memory, the user is expected to choose a reasonable minimum and maximum 
number of workers that make sense for their application.


and then a bunch of other maps get started trying to recover- in the 
end 32 tasks get launched (4 attempts for each one I'm assuming - 4 is 
the default map.max.attempts., I didn't change it). The number of 
simultaneously running maps is always less than 8 though - not sure 
why, but the job would need to have 8 running simultaneously in order 
to recover. Does it have anything to do with the fact that the number 
of workers is fixed - from the source code looks like the 
PageRankBenchmark effectively sets the minWorker and maxWorker to the 
number specified at the command line? I'm just making un-educated 
guesses at this point.


You are right =), good guess.  There is one unit test that checks 
whether the automatic checkpoint restart works (only when run against a 
real Hadoop instance).  See 
src/test/java/org/apache/giraph/TestAutoCheckpoint.java.  It fails a 
worker and then recovers from a previous checkpoint.


Hopefully the logs give you some useful info. Let me know if you have 
any questions about them or you need more info. I'm hoping it's 
something relevant rather than something stupid I might be doing




The master log appears to confirm what you suspected:

2011-11-27 03:34:54,073 INFO org.apache.giraph.graph.BspServiceMaster: 
setJobState: 
{"_stateKey":"START_SUPERSTEP","_applicationAttemptKey":1,"_superstepKey":2} 
on superstep 2
2011-11-27 03:35:34,932 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 0 of 10 attempts.
2011-11-27 03:36:04,941 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 1 of 10 attempts.
2011-11-27 03:36:34,951 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 2 of 10 attempts.
2011-11-27 03:37:04,962 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 3 of 10 attempts.
2011-11-27 03:37:34,971 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 4 of 10 attempts.
2011-11-27 03:38:04,982 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 5 of 10 attempts.
2011-11-27 03:38:34,991 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only found 6 responses of 8 needed to start superstep 2.  
Sleeping for 3 msecs and used 6 of 10 attempts.
2011-11-27 03:39:05,002 INFO org.apache.giraph.graph.BspServiceMaster: 
checkWorkers: Only 

Re: Benchmark runs but tests fail

2011-11-27 Thread Oana Theogarajan

 Hi Hyunsik,
My Hadoop version is 0.20.203
I managed to get the tests running ( I was specifying the wrong port for 
my hadoop instance). However the test do fail with the following message:


---
 T E S T S
---
Running org.apache.giraph.TestManualCheckpoint
Setting tasks to 3 for testBspCheckpoint since JobTracker exists...
setup: Sending job to job tracker 
hdfs://ip-10-202-59-170.ec2.internal:50002 with jar path 
target/giraph-0.70-jar-with-dependencies.jar for testBspCheckpoint
11/11/26 23:50:34 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.

11/11/26 23:50:35 INFO mapred.JobClient: Running job: job_20240142_0036
11/11/26 23:50:36 INFO mapred.JobClient:  map 0% reduce 0%
11/11/26 23:50:52 INFO mapred.JobClient:  map 25% reduce 0%
11/11/26 23:51:21 INFO mapred.JobClient:  map 0% reduce 0%
11/11/26 23:51:26 INFO mapred.JobClient: Task Id : 
attempt_20240142_0036_m_00_0, Status : FAILED

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

Thanks
Oana


On 11/27/11 5:15 PM, Hyunsik Choi wrote:

Hi Oana,

I have a question. What version is your hadoop local cluster?
The below errors usually occur when RPC version is mismatch.

java.io.IOException: Call to localhost/127.0.0.1:50030 
 failed on local exception: java.io.EOFException

   at org.apache.hadoop.ipc.Client.wrapException(Client.java:1065)
   at org.apache.hadoop.ipc.Client.call(Client.java:1033)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
   at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown Source)
...

Thank you for reporting
--
Hyunsik Choi

On Sun, Nov 27, 2011 at 7:09 AM, Oana Theogarajan > wrote:


Hi,
I've been testing Giraph on a hadoop custer set up on Amazon EC2
and I encounter some issues. I can successfully run the
PageRankBenchmark, however if I am trying to test the fault
tolerance by killing a tasktracker the job eventually dies after
trying repeatedly. I have checkpoints enabled (the default every 2
supersteps - and I can see them written in the checkpointing
directory)
I then tried to run the unit tests using
mvn test -Dprop.mapred.job.tracker=localhost:50030
and a lot of them fail. The output is quoted below. The surefire
logs show the following error. I am pretty new to both hadoop and
Giraph and I can't tell what could cause this error. I am puzzled
since can run Giraph PageRankBenchmark jobs but the tests fail.

Thanks in advance for your help figuring this out.
Best,
   Oana

Tests run: 9, Failures: 0, Errors: 7, Skipped: 0, Time elapsed:
0.5 sec <<< FAILURE!
testBspFail(org.apache.giraph.TestBspBasic)  Time elapsed: 0.054
sec <<< ERROR!
java.io.IOException: Call to localhost/127.0.0.1:50030
 failed on local exception:
java.io.EOFException
   at org.apache.hadoop.ipc.Client.wrapException(Client.java:1065)
   at org.apache.hadoop.ipc.Client.call(Client.java:1033)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
   at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown
Source)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:364)
   at
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:460)
   at org.apache.hadoop.mapred.JobClient.init(JobClient.java:454)
   at org.apache.hadoop.mapred.JobClient.(JobClient.java:437)
   at org.apache.hadoop.mapreduce.Job$1.run(Job.java:477)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:416)
   at

org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
   at org.apache.hadoop.mapreduce.Job.connect(Job.java:475)
   at org.apache.hadoop.mapreduce.Job.submit(Job.java:464)
   at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
   at org.apache.giraph.graph.GiraphJob.run(GiraphJob.java:524)
   at
org.apache.giraph.TestBspBasic.testBspFail(TestBspBasic.java:180)
Caused by: java.io.EOFException
   at java.io.DataInputStream.readInt(DataInputStream.java:392)
   at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:774)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:712)


---
 T E S T S
---
Running org.apache.giraph.TestManualCheckpoint
Setting tasks to 3 for testBspCheckpoint since JobTracker exists...
setup: Sendi

Re: Benchmark runs but tests fail

2011-11-27 Thread Hyunsik Choi
Hi Oana,

I have a question. What version is your hadoop local cluster?
The below errors usually occur when RPC version is mismatch.

java.io.IOException: Call to localhost/127.0.0.1:50030 failed on local
exception: java.io.EOFException
   at org.apache.hadoop.ipc.Client.**wrapException(Client.java:**1065)
   at org.apache.hadoop.ipc.Client.**call(Client.java:1033)
   at org.apache.hadoop.ipc.RPC$**Invoker.invoke(RPC.java:224)
   at org.apache.hadoop.mapred.$**Proxy2.getProtocolVersion(**Unknown
Source)
...

Thank you for reporting
--
Hyunsik Choi

On Sun, Nov 27, 2011 at 7:09 AM, Oana Theogarajan wrote:

> Hi,
> I've been testing Giraph on a hadoop custer set up on Amazon EC2 and I
> encounter some issues. I can successfully run the PageRankBenchmark,
> however if I am trying to test the fault tolerance by killing a tasktracker
> the job eventually dies after trying repeatedly. I have checkpoints enabled
> (the default every 2 supersteps - and I can see them written in the
> checkpointing directory)
> I then tried to run the unit tests using
> mvn test -Dprop.mapred.job.tracker=**localhost:50030
> and a lot of them fail. The output is quoted below. The surefire logs show
> the following error. I am pretty new to both hadoop and Giraph and I can't
> tell what could cause this error. I am puzzled since can run Giraph
> PageRankBenchmark jobs but the tests fail.
>
> Thanks in advance for your help figuring this out.
> Best,
>Oana
>
> Tests run: 9, Failures: 0, Errors: 7, Skipped: 0, Time elapsed: 0.5 sec
> <<< FAILURE!
> testBspFail(org.apache.giraph.**TestBspBasic)  Time elapsed: 0.054 sec
> <<< ERROR!
> java.io.IOException: Call to localhost/127.0.0.1:50030 failed on local
> exception: java.io.EOFException
>at org.apache.hadoop.ipc.Client.**wrapException(Client.java:**1065)
>at org.apache.hadoop.ipc.Client.**call(Client.java:1033)
>at org.apache.hadoop.ipc.RPC$**Invoker.invoke(RPC.java:224)
>at org.apache.hadoop.mapred.$**Proxy2.getProtocolVersion(**Unknown
> Source)
>at org.apache.hadoop.ipc.RPC.**getProxy(RPC.java:364)
>at org.apache.hadoop.mapred.**JobClient.createRPCProxy(**
> JobClient.java:460)
>at org.apache.hadoop.mapred.**JobClient.init(JobClient.java:**454)
>at org.apache.hadoop.mapred.**JobClient.(JobClient.**java:437)
>at org.apache.hadoop.mapreduce.**Job$1.run(Job.java:477)
>at java.security.**AccessController.doPrivileged(**Native Method)
>at javax.security.auth.Subject.**doAs(Subject.java:416)
>at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1059)
>at org.apache.hadoop.mapreduce.**Job.connect(Job.java:475)
>at org.apache.hadoop.mapreduce.**Job.submit(Job.java:464)
>at org.apache.hadoop.mapreduce.**Job.waitForCompletion(Job.**java:494)
>at org.apache.giraph.graph.**GiraphJob.run(GiraphJob.java:**524)
>at org.apache.giraph.**TestBspBasic.testBspFail(**
> TestBspBasic.java:180)
> Caused by: java.io.EOFException
>at java.io.DataInputStream.**readInt(DataInputStream.java:**392)
>at org.apache.hadoop.ipc.Client$**Connection.receiveResponse(**
> Client.java:774)
>at org.apache.hadoop.ipc.Client$**Connection.run(Client.java:**712)
>
>
> --**-
>  T E S T S
> --**-
> Running org.apache.giraph.**TestManualCheckpoint
> Setting tasks to 3 for testBspCheckpoint since JobTracker exists...
> setup: Sending job to job tracker localhost:50030 with jar path
> target/giraph-0.70-jar-with-**dependencies.jar for testBspCheckpoint
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.664 sec
> <<< FAILURE!
> Running org.apache.giraph.**TestAutoCheckpoint
> Setting tasks to 3 for testSingleFault since JobTracker exists...
> setup: Sending job to job tracker localhost:50030 with jar path
> target/giraph-0.70-jar-with-**dependencies.jar for testSingleFault
> Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.069 sec
> <<< FAILURE!
> Running org.apache.giraph.TestBspBasic
> Setting tasks to 3 for testInstantiateVertex since JobTracker exists...
> testInstantiateVertex: java.class.path=/home/ubuntu/**
> giraph/trunk/target/test-**classes:/home/ubuntu/giraph/**
> trunk/target/classes:/home/**ubuntu/.m2/repository/junit/**
> junit/3.8.1/junit-3.8.1.jar:/**home/ubuntu/.m2/repository/**
> org/apache/hadoop/hadoop-core/**0.20.203.0/hadoop-core-0.20.**
> 203.0.jar:/home/ubuntu/.m2/**repository/xmlenc/xmlenc/0.52/**
> xmlenc-0.52.jar:/home/ubuntu/.**m2/repository/commons-**
> httpclient/commons-httpclient/**3.0.1/commons-httpclient-3.0.**
> 1.jar:/home/ubuntu/.m2/**repository/commons-logging/**
> commons-logging/1.0.3/commons-**logging-1.0.3.jar:/home/**
> ubuntu/.m2/repository/commons-**codec/commons-codec/1.4/**
> commons-codec-1.4.jar:/home/**ubuntu/.m2/repository/org/**
> apache/commons/commons-math/2.**1/commons-math-2.1.jar:/home/**
> ubuntu/.m2/repository/commons-

Re: Benchmark runs but tests fail

2011-11-26 Thread Avery Ching

Hi Oana,

Thanks for your questions.  The fault tolerance should work if there is 
a viable checkpoint and there is a master and ZooKeeper process 
available to coordinate the application.  The only reason I believe that 
the fault tolerance won't work is if the number of task failures is 
exceeded (Hadoop configurable variable - map.max.attempts).  Can you 
show me the log of the master task?  It would be really helpful.


As far as the unittests failing, do you actually have a Hadoop instance 
running at localhost:50030?  The unittests can be run two different ways:


- Against an actual Hadoop instance (i.e. mvn test 
-Dprop.mapred.job.tracker=:)


- Using something called LocalJobRunner that simulates a Hadoop instance 
with a single map task at a time (i.e mvn test).


Hope that helps, let me know if you have other questions.

Avery

On 11/26/11 3:09 PM, Oana Theogarajan wrote:

Hi,
I've been testing Giraph on a hadoop custer set up on Amazon EC2 and I 
encounter some issues. I can successfully run the PageRankBenchmark, 
however if I am trying to test the fault tolerance by killing a 
tasktracker the job eventually dies after trying repeatedly. I have 
checkpoints enabled (the default every 2 supersteps - and I can see 
them written in the checkpointing directory)

I then tried to run the unit tests using
mvn test -Dprop.mapred.job.tracker=localhost:50030
and a lot of them fail. The output is quoted below. The surefire logs 
show the following error. I am pretty new to both hadoop and Giraph 
and I can't tell what could cause this error. I am puzzled since can 
run Giraph PageRankBenchmark jobs but the tests fail.


Thanks in advance for your help figuring this out.
Best,
Oana

Tests run: 9, Failures: 0, Errors: 7, Skipped: 0, Time elapsed: 0.5 
sec <<< FAILURE!
testBspFail(org.apache.giraph.TestBspBasic)  Time elapsed: 0.054 sec 
<<< ERROR!
java.io.IOException: Call to localhost/127.0.0.1:50030 failed on local 
exception: java.io.EOFException

at org.apache.hadoop.ipc.Client.wrapException(Client.java:1065)
at org.apache.hadoop.ipc.Client.call(Client.java:1033)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
at org.apache.hadoop.mapred.$Proxy2.getProtocolVersion(Unknown 
Source)

at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:364)
at 
org.apache.hadoop.mapred.JobClient.createRPCProxy(JobClient.java:460)

at org.apache.hadoop.mapred.JobClient.init(JobClient.java:454)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:437)
at org.apache.hadoop.mapreduce.Job$1.run(Job.java:477)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)

at org.apache.hadoop.mapreduce.Job.connect(Job.java:475)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:464)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
at org.apache.giraph.graph.GiraphJob.run(GiraphJob.java:524)
at org.apache.giraph.TestBspBasic.testBspFail(TestBspBasic.java:180)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:774)

at org.apache.hadoop.ipc.Client$Connection.run(Client.java:712)


---
 T E S T S
---
Running org.apache.giraph.TestManualCheckpoint
Setting tasks to 3 for testBspCheckpoint since JobTracker exists...
setup: Sending job to job tracker localhost:50030 with jar path 
target/giraph-0.70-jar-with-dependencies.jar for testBspCheckpoint
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.664 
sec <<< FAILURE!

Running org.apache.giraph.TestAutoCheckpoint
Setting tasks to 3 for testSingleFault since JobTracker exists...
setup: Sending job to job tracker localhost:50030 with jar path 
target/giraph-0.70-jar-with-dependencies.jar for testSingleFault
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.069 
sec <<< FAILURE!

Running org.apache.giraph.TestBspBasic
Setting tasks to 3 for testInstantiateVertex since JobTracker exists...
testInstantiateVertex: 
java.class.path=/home/ubuntu/giraph/trunk/target/test-classes:/home/ubuntu/giraph/trunk/target/classes:/home/ubuntu/.m2/repository/junit/junit/3.8.1/junit-3.8.1.jar:/home/ubuntu/.m2/repository/org/apache/hadoop/hadoop-core/0.20.203.0/hadoop-core-0.20.203.0.jar:/home/ubuntu/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar:/home/ubuntu/.m2/repository/commons-httpclient/commons-httpclient/3.0.1/commons-httpclient-3.0.1.jar:/home/ubuntu/.m2/repository/commons-logging/commons-logging/1.0.3/commons-logging-1.0.3.jar:/home/ubuntu/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/home/ubuntu/.m2/repository/org/apache/commons/commons-math/2.1/commons-math-2.1.jar:/home/ubuntu/.