Existing "tail" source is not the best choice in your scenario, as you have pointed out.
SpoolDir could be a solution if your log file rotation policy is very low (5 minutes, for example), but then you have to deal with a huge number of files in the folder (slower listings). There is a proposal for a new approach, something that combines the best of "tail" and "spoolDir". Take a look here: https://issues.apache.org/jira/browse/FLUME-2498 2015-01-29 0:24 GMT+01:00 Lakshmanan Muthuraman <lakshma...@tokbox.com>: > We have been using Flume to solve a very similar usecase. Our servers write > the log files to a local file system, and then we have flume agent which > ships the data to kafka. > > Flume you can use as exec source running tail. Though the exec source runs > well with tail, there are issues if the agent goes down or the file channel > starts building up. If the agent goes down, you can request flume exec tail > source to go back n number of lines or read from beginning of the file. The > challenge is we roll our log files on a daily basis. What if goes down in > the evening. We need to go back to the entire days worth of data for > reprocessing which slows down the data flow. We can also go back arbitarily > number of lines, but then we dont know what is the right number to go back. > This is kind of challenge for us. We have tried spooling directory. Which > works, but we need to have a different log file rotation policy. We > considered evening going a file rotation for a minute, but it will still > affect the real time data flow in our kafka--->storm-->Elastic search > pipeline with a minute delay. > > We are going to do a poc on logstash to see how this solves the problem of > flume. > > On Wed, Jan 28, 2015 at 10:39 AM, Fernando O. <fot...@gmail.com> wrote: > > > Hi all, > > I'm evaluating using Kafka. > > > > I liked this thing of Facebook scribe that you log to your own machine > and > > then there's a separate process that forwards messages to the central > > logger. > > > > With Kafka it seems that I have to embed the publisher in my app, and > deal > > with any communication problem managing that on the producer side. > > > > I googled quite a bit trying to find a project that would basically use > > daemon that parses a log file and send the lines to the Kafka cluster > > (something like a tail file.log but instead of redirecting the output to > > the console: send it to kafka) > > > > Does anyone knows about something like that? > > > > > > Thanks! > > Fernando. > > > -- David Morales de Frías :: +34 607 010 411 :: @dmoralesdf <https://twitter.com/dmoralesdf> <http://www.stratio.com/> Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, Madrid Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd <https://twitter.com/StratioBD>*