I believe what Joe was referring to with RouteText was that it can take a
regular expression with a capture group, and output a FlowFile per unique
value of the capturing group. So if the incoming data is a FlowFile with a
bunch of syslog messages and you provide a regex that captures hostname, it
can produced a FlowFile per unique hostname with all the messages that go
with that hostname.

I don't want to side track the conversation about how to use MergeContent
properly, but wanted to add a couple of things about how ListenSyslog
works...

There is an attribute called "syslog.sender" which is the host that the
message was received from, the value is populated from the incoming
connection in Java code, not from anything in the syslog message. This
should essentially be the host of the syslog server/forwarder.

There is an attribute called "syslog.hostname" which is the hostname in the
syslog message itself, which should be the host that produced that message
and sent it to a syslog server.

By default ListenSyslog has parse set to true and batch size set to 1. If
you set parse to false and increase the batch size to say 100, it will try
to grab a maximum of 100 messages in each execution of the processor (could
be less depending on timing and what is available), and for those 100
messages it groups them by the "sender" (described above) and outputs a
flow file per sender.

Batching can definitely get much higher through put on ListenSyslog, but if
you have to parse them later in the flow with ParseSyslog then you still
need to get each message into its own FlowFile, which most likely entails
SplitText with a line count of 1 and then ParseSyslog. I don't know if this
turns out much better then just letting ListenSyslog parse them in the
first place. If you are letting ListenSyslog do the parsing then you can
increase the concurrent tasks on the processor which means more threads
parsing syslog messages and outputing FlowFiles.

I think the batching concept makes the most sense when you don't need to
parse the messages and just want to deliver the raw messages somewhere like
HDFS, or Kafka.

-Bryan


On Sun, Feb 7, 2016 at 10:03 AM, Andre <andre-li...@fucs.org> wrote:

>
> > You can use RouteText to group (rather than split) on some shared
> pattern such as the hostname.  Will be far more efficient than splitting
> each line then grouping on that hostname.
>
> Not sure I understand?
>
>
>

Reply via email to