[jira] [Commented] (GIRAPH-77) Coordinator should expose a web interface with progress, vertex region assignments, etc.
[ https://issues.apache.org/jira/browse/GIRAPH-77?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151910#comment-13151910 ] Arun Suresh commented on GIRAPH-77: --- This looks very related to [GIRAPH-76|https://issues.apache.org/jira/browse/GIRAPH-76] since I will be refactoring GraphMapper. I can take this up as well if you havn't already started.. Coordinator should expose a web interface with progress, vertex region assignments, etc. Key: GIRAPH-77 URL: https://issues.apache.org/jira/browse/GIRAPH-77 Project: Giraph Issue Type: New Feature Reporter: Jakob Homan It would be nice if the coordinator worker had a web interface that showed progress, splits, etc. during job execution. Right now it would duplicate information currently being exposed through task status, but with the move to YARN, it will be a necessity. It would be great if we could do this in a modern way to avoid the screen-scraping, etc. currently used to get information from most other Hadoop project's web interfaces. The coordinator could announce its address at the beginning or via status updates. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-96) Support for Graphs with Huge adjacency lists
[ https://issues.apache.org/jira/browse/GIRAPH-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13152087#comment-13152087 ] Arun Suresh commented on GIRAPH-96: --- Looks like Claudio beat me to a similar suggestion [GIRAPH-94|https://issues.apache.org/jira/browse/GIRAPH-94] My proposal was more for a standard means of storing vertex/adjacency list information. The Giraph framework would handle the storage and would expose APIs which the Vertex reader can use to store the information as it reads the graph. The user would then not be required to subclass a Vertex class and implement the initialize() method. All adjacency list/vertex manipulation would go thru the common data store. Support for Graphs with Huge adjacency lists Key: GIRAPH-96 URL: https://issues.apache.org/jira/browse/GIRAPH-96 Project: Giraph Issue Type: Improvement Reporter: Arun Suresh Currently the vertex initialize() method is passed the complete adjacency list as a HashMap. All the current concrete implementations of Vertex iterate over the adjacency list and recreate new Data Structures within the Vertex instance to hold/manipulate the adjacency list. This would seize to be feasible once the size of the adjacency list becomes really huge. I propose storing the adjacency list and all vertex information (and incoming messages ?) in a distributed data store such as HBase. The adjacency list can be lazily loaded via HBase Scans. I was thinking of an HBase schema where the row Id is a concatenation of VertexID+OutboundVertexId with a single column containing the edge. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-91) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings)
[ https://issues.apache.org/jira/browse/GIRAPH-91?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151823#comment-13151823 ] Arun Suresh commented on GIRAPH-91: --- Avery, I see that you have used 2 sorted ArrayLists. Couldnt a LinkedHashMap have been an alternative ? I understand that the getEdgeValue and hasEdgeVale would be faster if it were a sortedArrayList. Also arraylists are more compact. But I was just wondering.. in the event that the graph is truly large (millions of edges, for a vertex) would it make sense to have the entire edgelist in memory in the first place ? we might need a scheme where only a part of the list is in memory and have chunks of the list fetched on demand when the provided iterator calls next(). In which case we can have a hybrid array + linked list (linked list of chunks of the edgelist) Large-memory improvements (Memory reduced vertex implementation, fast failure, added settings) --- Key: GIRAPH-91 URL: https://issues.apache.org/jira/browse/GIRAPH-91 Project: Giraph Issue Type: Improvement Reporter: Avery Ching Assignee: Avery Ching Attachments: GIRAPH-91.diff Current vertex implementation uses a HashMap for storing the edges, which is quite memory heavy for large graphs. The default settings in Giraph need to be improved for large graphs and heaps of 20G. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-76) Refactor worker logic from GraphMapper
[ https://issues.apache.org/jira/browse/GIRAPH-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151825#comment-13151825 ] Arun Suresh commented on GIRAPH-76: --- Yes, this does sound like a good idea. I could take a crack at it you havn't already started. Refactor worker logic from GraphMapper -- Key: GIRAPH-76 URL: https://issues.apache.org/jira/browse/GIRAPH-76 Project: Giraph Issue Type: Improvement Components: graph Reporter: Jakob Homan The plumbing around executing vertices is hosted within the mapper, but could be extracted to its own class and executed from the Mapper directly. This would ease testing and make it easier to host in the new YARN infrastructure. There's nothing mapper specific about this code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-93) Hive input / output format
[ https://issues.apache.org/jira/browse/GIRAPH-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13151851#comment-13151851 ] Arun Suresh commented on GIRAPH-93: --- Avery, This might not be an optimal solution, but just putting it out there. I understand Hive exposes a JDBC interface. Once can use the JDBC interface and the DbInputFormat http://www.cloudera.com/blog/2009/03/database-access-with-hadoop/ to load data from a Hive table for a Map Reduce Job Hive input / output format -- Key: GIRAPH-93 URL: https://issues.apache.org/jira/browse/GIRAPH-93 Project: Giraph Issue Type: New Feature Reporter: Avery Ching Assignee: Avery Ching It would be great to be able to load/store data from/to Hive tables. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-61) Worker's early failure will cause the whole system fail
[ https://issues.apache.org/jira/browse/GIRAPH-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13150646#comment-13150646 ] Arun Suresh commented on GIRAPH-61: --- Since a Giraph Job is a SINGLE Hadoop Job, and since the MasterThread has no other option but to Fail the current job if all workers dont respond (I cannot spawn new workers / restart existing workers) there might not be a way around this in the current scheme. But consider what if the Giraph Job were composed of two Hadoop Jobs.. Basically, the first Job would start just the Master Map Task which takes care of Job initialization, starting Zookeeper etc. Finally, it will kick off a second Hadoop Job. The Second Hadoop Job would be similar to the current Giraph Job and will spawn only Worker Tasks. The Master Task from the first job stays alive until the algorithm completes. Pig apparently does something similar actually. It starts a single Map task first which in-turn spawns multiple subtasks. Pros: * The whole Giraph Job need not fail if a worker fails at startup as above. Under the new scheme, the Master will detect that all workers have not responded at the start of the superstep.. and can opt to completely restart JUST the Second job. * The GiraphMapper class can be split into a MasterMapper and WorkerMapper. Might possibly make it a bit cleaner (Basically there are a couple of places where an if (mapFunctions == MapFunctions.MASTER_ONLY) is required to differentiate the Task from being a Master/Worker. This can cleanly be refactored into two class each with a specific responsibility). * At some point of time, we will possibly move to YARN where Giraph would be userland code decoupled from MapReduce itself. In which case, the Master would map to the ApplicationMaster. The refactoring would allow Giraph to more easily retrofitted to YARN. Worker's early failure will cause the whole system fail --- Key: GIRAPH-61 URL: https://issues.apache.org/jira/browse/GIRAPH-61 Project: Giraph Issue Type: Bug Components: bsp Affects Versions: 0.70.0 Reporter: Zhiwei Gu Priority: Critical When there's early failure happens to a worker, the whole system will fail. Observed failed worker: State: Creating RPC threads failed Result: It will cause the worker fail, however, master has already recorded and reserved these splits to this worker (identified by InetAddress), thus although hadoop reschedule a mapper for this worker, the master still waiting for the old worker's response, finally, the master will fail. [Failed worker logs:] 2011-10-24 18:19:51,051 INFO org.apache.giraph.graph.BspService: process: vertexRangeAssignmentsReadyChanged (vertex ranges are assigned) 2011-10-24 18:19:51,060 INFO org.apache.giraph.graph.BspServiceWorker: startSuperstep: Ready for computation on superstep 1 since worker selection and vertex range assignments are done in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/1/_vertexRangeAssignments 2011-10-24 18:19:51,078 INFO org.apache.giraph.graph.BspServiceWorker: getAggregatorValues: no aggregators in /_hadoopBsp/job_201108260911_842943/_applicationAttemptsDir/0/_superstepDir/0/_mergedAggregatorDir on superstep 1 2011-10-24 18:19:53,974 INFO org.apache.giraph.graph.GraphMapper: map: totalMem=84213760 maxMem=2067988480 freeMem=65069808 2011-10-24 18:19:53,974 INFO org.apache.giraph.comm.BasicRPCCommunications: flush: starting... 2011-10-24 18:19:54,022 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=102400 and reduceRetainSize=102400 2011-10-24 18:19:54,023 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:597) at java.lang.UNIXProcess$1.run(UNIXProcess.java:141) at java.security.AccessController.doPrivileged(Native Method) at java.lang.UNIXProcess.init(UNIXProcess.java:103) at java.lang.ProcessImpl.start(ProcessImpl.java:65) at java.lang.ProcessBuilder.start(ProcessBuilder.java:453) at org.apache.hadoop.util.Shell.runCommand(Shell.java:200) at org.apache.hadoop.util.Shell.run(Shell.java:182) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) at org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:540) at org.apache.hadoop.fs.RawLocalFileSystem.access$100(RawLocalFileSystem.java:37) at