Re: Large-scale collection of logs from multiple Hadoop nodes

Alexander Lorenz Wed, 07 Aug 2013 04:54:02 -0700

Hi,

the approach with Flume is the most reliable workflow for, since Flume has a 
builtin Syslog source as well a loadbalancing channel. On top you can define 
multiple channels for different sources.


Best,
Alex

sent via my mobile device

mapredit.blogspot.com
@mapredit


> On Aug 7, 2013, at 1:44 PM, 武泽胜 <[email protected]> wrote:
> 
> We have the same scenario as you described. The following is our solution, 
> just FYI:
> 
> We installed a local scribe agent on every node of our cluster, and we have 
> several central scribe servers. We extended log4j to support writing logs to 
> the local scribe agent,  and the local scribe agents forward the logs to the 
> central scribe servers, at last the central scribe servers write these logs 
> to a specified hdfs cluster used for offline processing.
> 
> Then we use hive/impale to analyse  the collected logs.
> 
> From: Public Network Services <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Tuesday, August 6, 2013 1:58 AM
> To: "[email protected]" <[email protected]>
> Subject: Large-scale collection of logs from multiple Hadoop nodes
> 
> Hi...
> 
> I am facing a large-scale usage scenario of log collection from a Hadoop 
> cluster and examining ways as to how it should be implemented.
> 
> More specifically, imagine a cluster that has hundreds of nodes, each of 
> which constantly produces Syslog events that need to be gathered an analyzed 
> at another point. The total amount of logs could be tens of gigabytes per 
> day, if not more, and the reception rate in the order of thousands of events 
> per second, if not more.
> 
> One solution is to send those events over the network (e.g., using using 
> flume) and collect them in one or more (less than 5) nodes in the cluster, or 
> in another location, whereby the logs will be processed by a either 
> constantly MapReduce job, or by non-Hadoop servers running some log 
> processing application.
> 
> Another approach could be to deposit all these events into a queuing system 
> like ActiveMQ or RabbitMQ, or whatever.
> 
> In all cases, the main objective is to be able to do real-time log analysis.
> 
> What would be the best way of implementing the above scenario?
> 
> Thanks!
> 
> PNS
>

Re: Large-scale collection of logs from multiple Hadoop nodes

Reply via email to