Hi Francis, First I would like to know if anyone has some documentation a bit more > comprehensive cluster configuration hama? >
On AWS EC2, you can use Apache Whirr to configure the cluster. Edward may share his procedures on maintaining his Hama cluster on Oracle BDA. > I would also like some information about the cluster configuration HAMA as: > > 1) I have a cluster with 12 computers in HDFS which the optimal > configuration of replication? configured to create 3 replicas of files, > this is the best? > > This depends on your availability requirements and the capacity of your cluster. 3 would be good if you cannot tolerate data-loss. You would have to work this out depending on the size of data and the capacity of your cluster. > 2) In my hama-site.xml for the best cluster configuration parameter > hama.zookeeper.quorum? 1 node 2 nodes, 3 nodes. > > Once again this depends on your availability requirements and the usage of cluster. > 3) When I process my graph with just over 65 000 vertices got the following > error: > attempt_201212260904_0005_000031_0: Exception in thread "pool-2-thread-1" > java.lang.OutOfMemoryError: GC overhead limit exceeded > attempt_201212260904_0005_000031_0: Exception in thread "Thread-1" > java.lang.OutOfMemoryError: GC overhead limit exceeded > > Is there any parameter I change more increase the memory limit? Or my > cluster will not be able to process this amount of information? With > smaller graphs it works correctly. I'm working with the all-pairs problem. > As reported recently by other users, Hama is facing scalability issues. I am trying to close - https://issues.apache.org/jira/browse/HAMA-559 and some other message object lifecycle issues.(Today we create a new Writable object for every message read and received.) Also , we keep all the vertices in the memory. However, you can change your JVM arguments. Please look at what you can do with the configuration parameter - bsp.child.java.opts. The default value could be found in hama-default.xml. Regards, Suraj
