HI, Thomas: Sorry to bother you. When I run some small graph test on my cluster, a 25Mb graph data job can be succeed, I can get the right output file on HDFS. But the 50Mb can not. when the job fails, I got the *ZooKeeper logs end up likes:* * * 2012-09-17 21:04:27,866 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid 0x239d433755a0014, likely client has closed socket 2012-09-17 21:04:32,666 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.0.2:57977 which had sessionid 0x239d433755a0014 2012-09-17 21:04:36,551 WARN org.apache.zookeeper.server.NIOServerCnxn: EndOfStreamException: Unable to read additional data from client sessionid 0x239d433755a0013, likely client has closed socket 2012-09-17 21:04:36,989 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /192.168.0.3:44924 which had sessionid 0x239d433755a0013
*GroomServer logs likes:* 2012-09-17 21:03:37,679 INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. 2012-09-17 21:03:37,982 INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201008172027_0007_000002_0' has started. 2012-09-17 21:03:37,983 INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. 2012-09-17 21:03:38,073 INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201008172027_0007_000000_0' has started. 2012-09-17 21:03:38,074 INFO org.apache.hama.bsp.GroomServer: Launch 3 tasks. 2012-09-17 21:03:38,325 INFO org.apache.hama.bsp.GroomServer: Task 'attempt_201008172027_0007_000001_0' has started. 2012-09-17 21:04:23,161 INFO org.apache.hama.bsp.GroomServer: adding purge task: attempt_201008172027_0007_000000_0 2012-09-17 21:04:23,513 INFO org.apache.hama.bsp.GroomServer: adding purge task: attempt_201008172027_0007_000002_0 2012-09-17 21:04:23,513 INFO org.apache.hama.bsp.GroomServer: About to purge task: attempt_201008172027_0007_000000_0 2012-09-17 21:04:25,918 INFO org.apache.hama.bsp.GroomServer: About to purge task: attempt_201008172027_0007_000002_0 2012-09-17 21:04:30,707 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks. 2012-09-17 21:04:30,929 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks. 2012-09-17 21:04:30,929 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks. 2012-09-17 21:04:33,965 INFO org.apache.hama.bsp.GroomServer: Kill 1 tasks. *Task logs end up likes:* 12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x00a3ef26, / 192.168.0.3:34203 => 627-PC/192.168.0.5:61001] INTEREST_CHANGED 12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x00a3ef26, / 192.168.0.3:34203 => 627-PC/192.168.0.5:61001] INTEREST_CHANGED 12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0057bd52, / 192.168.0.3:53962 => 624-PC/192.168.1.2:61002] INTEREST_CHANGED 12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0057bd52, / 192.168.0.3:53962 => 624-PC/192.168.1.2:61002] INTEREST_CHANGED 12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0104ae5e, / 192.168.0.3:47749 => 625-PC/192.168.0.3:61003] INTEREST_CHANGED 12/09/17 21:04:11 INFO ipc.NettyTransceiver: [id: 0x0104ae5e, / 192.168.0.3:47749 => 625-PC/192.168.0.3:61003] INTEREST_CHANGED 12/09/17 21:04:12 INFO ipc.NettyTransceiver: [id: 0x00c0499d, / 192.168.0.3:36006 => 627-PC/192.168.0.5:61003] INTEREST_CHANGED 12/09/17 21:04:12 INFO ipc.NettyTransceiver: [id: 0x00c0499d, / 192.168.0.3:36006 => 627-PC/192.168.0.5:61003] INTEREST_CHANGED .......... Do you have any idea what may cause this kind of fail? Thanks a lot! 2012/9/15 Thomas Jungblut <[email protected]> > Okay I have observed this problem as well with 10gb of adjacency text file. > I was running on a 75gb instance on EC2 with 70gigs heap, which should be > no problem, but it fails after several steps. > I'm profiling it now in more detail. > > Can't be that 10gb text use more than 20gb of heap as graph with messages. > > 2012/9/14 Thomas Jungblut <[email protected]> > > > I would trim the spaces in the key and value. > > If it afterwards still crashes, I have no idea anymore and would > recommend > > you to take a heapdump with hprof and look what is sucking all that > memory. > > > > 2012/9/14 庄克琛 <[email protected]> > > > >> Hi, I set the property to hama-site.xml. > >> <property> > >> <name> hama.messenger.queue.class </name> > >> <value> org.apache.hama.bsp.message.DiskQueue </value> > >> </property> > >> Am I set it right? > >> and restart the hama,(stop-bspd.sh and start-bspd.sh), try the test job > >> again, and watch the memory slowly up to 70%, 80%, 90%, then crash... > >_< > >> > >> > >> 2012/9/14 Thomas Jungblut <[email protected]> > >> > >> > Yes, I wanted to have direct memory in Hama months ago, but hadn't > >> managed > >> > to find enough time. > >> > That is a very good idea. > >> > > >> > 2012/9/14 Tommaso Teofili <[email protected]> > >> > > >> > > I think we may also create an Apache DirectMemory based DiskQueue > >> which > >> > > cache things on disk but hides most of the complexity. > >> > > My 2 cents, > >> > > Tommaso > >> > > > >> > > 2012/9/14 Thomas Jungblut <[email protected]> > >> > > > >> > > > I have created an issue for that: > >> > > > HAMA-642<https://issues.apache.org/jira/browse/HAMA-642> > >> > > > > >> > > > 2012/9/14 Thomas Jungblut <[email protected]> > >> > > > > >> > > > > Basically I think that the graph should fit into memory of your > >> task. > >> > > > > So the messages could cause the overflow. > >> > > > > > >> > > > > You can try out the DiskQueue, this can be configured with > setting > >> > the > >> > > > > property "hama.messenger.queue.class" to > >> > > > > "org.apache.hama.bsp.message.DiskQueue". > >> > > > > > >> > > > > This will immediately flush the messages to disk. However this > is > >> > > > > experimental currently, so if you try it out please tell us if > it > >> > > helped. > >> > > > > > >> > > > > Thanks. > >> > > > > > >> > > > > To further scale this, we should write vertices that don't fit > in > >> > > memory > >> > > > > on the disk. I will add another jira for that soon. > >> > > > > > >> > > > > 2012/9/14 庄克琛 <[email protected]> > >> > > > > > >> > > > >> oh, the HDFS block size is 128Mb, not 64Mb, so the 73Mb graph > >> will > >> > not > >> > > > >> be split-ed on the HDFS. > >> > > > >> > >> > > > >> 2012/9/14 庄克琛 <[email protected]> > >> > > > >> > >> > > > >> > em... I have try your configure advise and restart the hama. > >> > > > >> > I use the Google web graph( > >> > > > >> > http://wiki.apache.org/hama/WriteHamaGraphFile ), > >> > > > >> > Nodes: 875713 Edges: 5105039, which is about 73Mb, upload to > a > >> > small > >> > > > >> HDFS > >> > > > >> > cluster(block size is 64Mb), test the PageRank in ( > >> > > > >> > http://wiki.apache.org/hama/WriteHamaGraphFile ), got the > >> result > >> > > as: > >> > > > >> > ################ > >> > > > >> > function@624-PC:~/hadoop-1.0.3/hama-0.6.0$ hama jar > hama-6-P* > >> > > > >> > input-google ouput-google > >> > > > >> > 12/09/14 14:27:50 INFO bsp.FileInputFormat: Total input paths > >> to > >> > > > >> process : > >> > > > >> > 1 > >> > > > >> > 12/09/14 14:27:50 INFO bsp.FileInputFormat: Total # of > splits: > >> 3 > >> > > > >> > 12/09/14 14:27:50 INFO bsp.BSPJobClient: Running job: > >> > > > >> job_201008141420_0004 > >> > > > >> > 12/09/14 14:27:53 INFO bsp.BSPJobClient: Current supersteps > >> > number: > >> > > 0 > >> > > > >> > Java HotSpot(TM) Server VM warning: Attempt to allocate stack > >> > guard > >> > > > >> pages > >> > > > >> > failed. > >> > > > >> > ################### > >> > > > >> > > >> > > > >> > Last time the supersteps could be 1 or 2, then the same > >> result. > >> > > > >> > the task attempt****.err files are empty. > >> > > > >> > Is the graph too large? > >> > > > >> > I test on a small graph, get the right Rank results > >> > > > >> > > >> > > > >> > > >> > > > >> > 2012/9/14 Edward J. Yoon <[email protected]> > >> > > > >> > > >> > > > >> > I've added multi-step partitioning method to save memory[1]. > >> > > > >> >> > >> > > > >> >> Please try to configure below property to hama-site.xml. > >> > > > >> >> > >> > > > >> >> <property> > >> > > > >> >> <name>hama.graph.multi.step.partitioning.interval</name> > >> > > > >> >> <value>10000000</value> > >> > > > >> >> </property> > >> > > > >> >> > >> > > > >> >> 1. https://issues.apache.org/jira/browse/HAMA-599 > >> > > > >> >> > >> > > > >> >> On Fri, Sep 14, 2012 at 3:13 PM, 庄克琛 < > [email protected]> > >> > > wrote: > >> > > > >> >> > HI, Actually I use this ( > >> > > > >> >> > > >> > > > >> >> > >> > > > >> > >> > > > > >> > > > >> > > >> > https://builds.apache.org/job/Hama-Nightly/672/artifact/.repository/org/apache/hama/hama-dist/0.6.0-SNAPSHOT/ > >> > > > >> >> > ) > >> > > > >> >> > to test again, I mean use this 0.6.0SNAPSHOT version > replace > >> > > > >> everything, > >> > > > >> >> > got the same out of memory results. I just don't know what > >> > cause > >> > > > the > >> > > > >> >> out of > >> > > > >> >> > memory fails, only some small graph computing can be > >> finished. > >> > Is > >> > > > >> this > >> > > > >> >> > version finished the " > >> > > > >> >> > [HAMA-596<https://issues.apache.org/jira/browse/HAMA-596 > >> > > > >]:Optimize > >> > > > >> >> > memory usage of graph job" ? > >> > > > >> >> > Thanks > >> > > > >> >> > > >> > > > >> >> > 2012/9/14 Thomas Jungblut <[email protected]> > >> > > > >> >> > > >> > > > >> >> >> Hey, what jar did you exactly replace? > >> > > > >> >> >> Am 14.09.2012 07:49 schrieb "庄克琛" < > [email protected] > >> >: > >> > > > >> >> >> > >> > > > >> >> >> > hi, every one: > >> > > > >> >> >> > I use the hama-0.5.0 with the hadoop-1.0.3, try to do > >> some > >> > > large > >> > > > >> >> graphs > >> > > > >> >> >> > analysis. > >> > > > >> >> >> > When I test the PageRank examples, as the ( > >> > > > >> >> >> > http://wiki.apache.org/hama/WriteHamaGraphFile) > shows, I > >> > > > download > >> > > > >> >> the > >> > > > >> >> >> > graph > >> > > > >> >> >> > data, and run the PageRank job on a small distributed > >> > cluser, > >> > > I > >> > > > >> can > >> > > > >> >> only > >> > > > >> >> >> > get the out of memory failed, with Superstep 0,1,2 > works > >> > well, > >> > > > >> then > >> > > > >> >> get > >> > > > >> >> >> the > >> > > > >> >> >> > memory out fail.(Each computer have 2G memory) But > when I > >> > test > >> > > > >> some > >> > > > >> >> small > >> > > > >> >> >> > graph, everything went well. > >> > > > >> >> >> > Also I try the trunk version( > >> > > > >> >> >> > > >> > > https://builds.apache.org/job/Hama-Nightly/672/changes#detail3 > >> > > > ), > >> > > > >> >> replace > >> > > > >> >> >> > my > >> > > > >> >> >> > hama-0.5.0 with the hama-0.6.0-snapshot, only get the > >> same > >> > > > >> results. > >> > > > >> >> >> > Anyone got better ideas? > >> > > > >> >> >> > > >> > > > >> >> >> > Thanks! > >> > > > >> >> >> > > >> > > > >> >> >> > -- > >> > > > >> >> >> > > >> > > > >> >> >> > *Zhuang Kechen > >> > > > >> >> >> > * > >> > > > >> >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > -- > >> > > > >> >> > > >> > > > >> >> > *Zhuang Kechen* > >> > > > >> >> > > >> > > > >> >> > School of Computer Science & Technology > >> > > > >> >> > > >> > > > >> >> > ** > >> > > > >> >> > Nanjing University of Science & Technology > >> > > > >> >> > > >> > > > >> >> > Lab.623, School of Computer Sci. & Tech. > >> > > > >> >> > > >> > > > >> >> > No.200, Xiaolingwei Street > >> > > > >> >> > > >> > > > >> >> > Nanjing, Jiangsu, 210094 > >> > > > >> >> > > >> > > > >> >> > P.R. China > >> > > > >> >> > > >> > > > >> >> > Tel: 025-84315982** > >> > > > >> >> > > >> > > > >> >> > Email: [email protected] > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> > >> > > > >> >> -- > >> > > > >> >> Best Regards, Edward J. Yoon > >> > > > >> >> @eddieyoon > >> > > > >> >> > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > -- > >> > > > >> > > >> > > > >> > *Zhuang Kechen > >> > > > >> > * > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > >> > > > >> > >> > > > >> -- > >> > > > >> > >> > > > >> *Zhuang Kechen* > >> > > > >> > >> > > > >> School of Computer Science & Technology > >> > > > >> > >> > > > >> ** > >> > > > >> Nanjing University of Science & Technology > >> > > > >> > >> > > > >> Lab.623, School of Computer Sci. & Tech. > >> > > > >> > >> > > > >> No.200, Xiaolingwei Street > >> > > > >> > >> > > > >> Nanjing, Jiangsu, 210094 > >> > > > >> > >> > > > >> P.R. China > >> > > > >> > >> > > > >> Tel: 025-84315982** > >> > > > >> > >> > > > >> Email: [email protected] > >> > > > >> > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> -- > >> > >> *Zhuang Kechen* > >> > > > > > -- *Zhuang Kechen*
