oh, and I forgot to mention Kafka Streams which has been heavily talked about the last few days at Strata here in San Jose.
Streams can simplify a lot of this architecture by perform some light-to-medium-complex transformations in Kafka itself. i'm waiting anxiously for Kafka 0.10 with production-ready Kafka Streams, so I can try this out myself - and hopefully remove a lot of extra plumbing. On Thu, Mar 31, 2016 at 4:42 AM, Chris Fregly <ch...@fregly.com> wrote: > this is a very common pattern, yes. > > note that in Netflix's case, they're currently pushing all of their logs > to a Fronting Kafka + Samza Router which can route to S3 (or HDFS), > ElasticSearch, and/or another Kafka Topic for further consumption by > internal apps using other technologies like Spark Streaming (instead of > Samza). > > this Fronting Kafka + Samza Router also helps to differentiate between > high-priority events (Errors or High Latencies) and normal-priority events > (normal User Play or Stop events). > > here's a recent presentation i did which details this configuration > starting at slide 104: > http://www.slideshare.net/cfregly/dc-spark-users-group-march-15-2016-spark-and-netflix-recommendations > . > > btw, Confluent's distribution of Kafka does have a direct Http/REST API > which is not recommended for production use, but has worked well for me in > the past. > > these are some additional options to think about, anyway. > > > On Thu, Mar 31, 2016 at 4:26 AM, Steve Loughran <ste...@hortonworks.com> > wrote: > >> >> On 31 Mar 2016, at 09:37, ashish rawat <dceash...@gmail.com> wrote: >> >> Hi, >> >> I have been evaluating Spark for analysing Application and Server Logs. I >> believe there are some downsides of doing this: >> >> 1. No direct mechanism of collecting log, so need to introduce other >> tools like Flume into the pipeline. >> >> >> you need something to collect logs no matter what you run. Flume isn't so >> bad; if you bring it up on the same host as the app then you can even >> collect logs while the network is playing up. >> >> Or you can just copy log4j files to HDFS and process them later >> >> 2. Need to write lots of code for parsing different patterns from logs, >> while some of the log analysis tools like logstash or loggly provide it out >> of the box >> >> >> >> Log parsing is essentially an ETL problem, especially if you don't try to >> lock down the log event format. >> >> You can also configure Log4J to save stuff in an easy-to-parse format >> and/or forward directly to your application. >> >> There's a log4j to flume connector to do that for you, >> >> >> http://www.thecloudavenue.com/2013/11/using-log4jflume-to-log-application.html >> >> or you can output in, say, JSON ( >> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/log/Log4Json.java >> ) >> >> I'd go with flume unless you had a need to save the logs locally and copy >> them to HDFS laster. >> >> >> >> On the benefits side, I believe Spark might be more performant (although >> I am yet to benchmark it) and being a generic processing engine, might work >> with complex use cases where the out of the box functionality of log >> analysis tools is not sufficient (although I don't have any such use case >> right now). >> >> One option I was considering was to use logstash for collection and basic >> processing and then sink the processed logs to both elastic search and >> kafka. So that Spark Streaming can pick data from Kafka for the complex use >> cases, while logstash filters can be used for the simpler use cases. >> >> I was wondering if someone has already done this evaluation and could >> provide me some pointers on how/if to create this pipeline with Spark. >> >> Regards, >> Ashish >> >> >> >> > > > -- > > *Chris Fregly* > Principal Data Solutions Engineer > IBM Spark Technology Center, San Francisco, CA > http://spark.tc | http://advancedspark.com > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com