On 31 Mar 2016, at 09:37, ashish rawat <[email protected]<mailto:[email protected]>> wrote:
Hi, I have been evaluating Spark for analysing Application and Server Logs. I believe there are some downsides of doing this: 1. No direct mechanism of collecting log, so need to introduce other tools like Flume into the pipeline. you need something to collect logs no matter what you run. Flume isn't so bad; if you bring it up on the same host as the app then you can even collect logs while the network is playing up. Or you can just copy log4j files to HDFS and process them later 2. Need to write lots of code for parsing different patterns from logs, while some of the log analysis tools like logstash or loggly provide it out of the box Log parsing is essentially an ETL problem, especially if you don't try to lock down the log event format. You can also configure Log4J to save stuff in an easy-to-parse format and/or forward directly to your application. There's a log4j to flume connector to do that for you, http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html or you can output in, say, JSON (https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java ) I'd go with flume unless you had a need to save the logs locally and copy them to HDFS laster. On the benefits side, I believe Spark might be more performant (although I am yet to benchmark it) and being a generic processing engine, might work with complex use cases where the out of the box functionality of log analysis tools is not sufficient (although I don't have any such use case right now). One option I was considering was to use logstash for collection and basic processing and then sink the processed logs to both elastic search and kafka. So that Spark Streaming can pick data from Kafka for the complex use cases, while logstash filters can be used for the simpler use cases. I was wondering if someone has already done this evaluation and could provide me some pointers on how/if to create this pipeline with Spark. Regards, Ashish
