Hi, the approach with Flume is the most reliable workflow for, since Flume has a builtin Syslog source as well a loadbalancing channel. On top you can define multiple channels for different sources.
Best, Alex sent via my mobile device mapredit.blogspot.com @mapredit > On Aug 7, 2013, at 1:44 PM, 武泽胜 <[email protected]> wrote: > > We have the same scenario as you described. The following is our solution, > just FYI: > > We installed a local scribe agent on every node of our cluster, and we have > several central scribe servers. We extended log4j to support writing logs to > the local scribe agent, and the local scribe agents forward the logs to the > central scribe servers, at last the central scribe servers write these logs > to a specified hdfs cluster used for offline processing. > > Then we use hive/impale to analyse the collected logs. > > From: Public Network Services <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, August 6, 2013 1:58 AM > To: "[email protected]" <[email protected]> > Subject: Large-scale collection of logs from multiple Hadoop nodes > > Hi... > > I am facing a large-scale usage scenario of log collection from a Hadoop > cluster and examining ways as to how it should be implemented. > > More specifically, imagine a cluster that has hundreds of nodes, each of > which constantly produces Syslog events that need to be gathered an analyzed > at another point. The total amount of logs could be tens of gigabytes per > day, if not more, and the reception rate in the order of thousands of events > per second, if not more. > > One solution is to send those events over the network (e.g., using using > flume) and collect them in one or more (less than 5) nodes in the cluster, or > in another location, whereby the logs will be processed by a either > constantly MapReduce job, or by non-Hadoop servers running some log > processing application. > > Another approach could be to deposit all these events into a queuing system > like ActiveMQ or RabbitMQ, or whatever. > > In all cases, the main objective is to be able to do real-time log analysis. > > What would be the best way of implementing the above scenario? > > Thanks! > > PNS >
