I believe what Joe was referring to with RouteText was that it can take a regular expression with a capture group, and output a FlowFile per unique value of the capturing group. So if the incoming data is a FlowFile with a bunch of syslog messages and you provide a regex that captures hostname, it can produced a FlowFile per unique hostname with all the messages that go with that hostname.
I don't want to side track the conversation about how to use MergeContent properly, but wanted to add a couple of things about how ListenSyslog works... There is an attribute called "syslog.sender" which is the host that the message was received from, the value is populated from the incoming connection in Java code, not from anything in the syslog message. This should essentially be the host of the syslog server/forwarder. There is an attribute called "syslog.hostname" which is the hostname in the syslog message itself, which should be the host that produced that message and sent it to a syslog server. By default ListenSyslog has parse set to true and batch size set to 1. If you set parse to false and increase the batch size to say 100, it will try to grab a maximum of 100 messages in each execution of the processor (could be less depending on timing and what is available), and for those 100 messages it groups them by the "sender" (described above) and outputs a flow file per sender. Batching can definitely get much higher through put on ListenSyslog, but if you have to parse them later in the flow with ParseSyslog then you still need to get each message into its own FlowFile, which most likely entails SplitText with a line count of 1 and then ParseSyslog. I don't know if this turns out much better then just letting ListenSyslog parse them in the first place. If you are letting ListenSyslog do the parsing then you can increase the concurrent tasks on the processor which means more threads parsing syslog messages and outputing FlowFiles. I think the batching concept makes the most sense when you don't need to parse the messages and just want to deliver the raw messages somewhere like HDFS, or Kafka. -Bryan On Sun, Feb 7, 2016 at 10:03 AM, Andre <andre-li...@fucs.org> wrote: > > > You can use RouteText to group (rather than split) on some shared > pattern such as the hostname. Will be far more efficient than splitting > each line then grouping on that hostname. > > Not sure I understand? > > >