Thank you all for the feedback, I went ahead and ran this solution for my
use case:

Lots of big files that may or may not have the first line contain the word
ERROR ... if it has the word ERROR I need to remove that line.

1) RouteOnContent and added a property ERROR with "ERROR" as value and set
the content buffer size to "100 KB"
2) ERROR relationship to ExecuteStreamCommand and run a tail -n +2 to
remove first line and keep rest of flowFile ... send output stream relation
further through the dataflow life cycle.

On Mon, Mar 6, 2017 at 8:27 PM Mark Payne <[email protected]> wrote:

> Joe,
>
> In terms of updating a processor to better handle this, we could update
> RouteText to do so. The idea being that if a large percentage of the time
> (say > 80% of the time) the FlowFile routed to one of the relationships is
> a single, contiguous subset of the data in the original FlowFile, then we
> could perform a 2-pass algorithm. In the first pass, we check to see if
> this is the case. If so, we can use session.clone(flowFile, offset, length)
> and we are done, without writing any content. If this is not the case, then
> we would have to perform a second pass over the data and write out the
> results as we do now.
>
> It is certainly not a trivial update to the processor but it is something
> that can be hidden from the user and in cases like this can provide
> significantly better performance. Something to think about.
>
> -Mark
>
> Sent from my iPhone
>
> > On Mar 6, 2017, at 5:25 PM, Joe Witt <[email protected]> wrote:
> >
> > Juan,
> >
> > I think RouteText is the right answer.  It would indeed need to check
> > all lines to determine whether the condition is satisfied and to
> > remove the first line it will need to write out all the remaining
> > lines.  If a majority of the input files do not have this problematic
> > header I'd use RouteOnContent with a small buffer (however many bytes
> > would be in the header line of an erroneous file) and check for the
> > presence of "ERROR".  If it did hit then I'd route to RouteText to do
> > this more expensive piece.  If it didn't hit then you can move it on
> > without paying the RouteText cost.
> >
> > It is useful to consider we'd have a processor, or update an existing
> > one to handle removal of lines from the beginning or end based on some
> > conditional.  Not sure what that would look like as the requirements
> > can get pretty specific.  I do think generally for such cases that
> > ExecuteScript processors offer an excellent tradeoff such that one can
> > build a very small focused and fast script to do precisely what they
> > need.
> >
> > Thanks
> > Joe
> >
> >> On Mon, Mar 6, 2017 at 4:50 PM, Lee Laim <[email protected]> wrote:
> >> Juan,
> >>
> >> If you're in the linux environment, you can use the Execute Stream
> command
> >> (ESC) processor to run "head -n 1"  the contents of the incoming large
> >> flowfile to  quickly extract the first line.  ESC has an option to put
> the
> >> output of the command directly into a new attribute, and pass the
> "original
> >> contents" to the next processor.  The value of the new attribute
> contains
> >> the first line, while the entire file remains in the flowfile
> contents.  You
> >> can use the new attribute for quick(er) routing decision.
> >>
> >> Thanks,
> >> Lee
> >>
> >>
> >>
> >>> On Mon, Mar 6, 2017 at 1:46 PM, Juan Sequeiros <[email protected]>
> wrote:
> >>>
> >>> Good afternoon all,,
> >>>
> >>> I am trying to remove the first line of a file if it has a certain
> word in
> >>> it "ERROR"
> >>> I know it will exist only in the first line ( I can not fix the reason
> why
> >>> it gets put there )
> >>>
> >>> These files are big and lots of them.
> >>>
> >>> and I can not find a "fast" fix to pop the first line of a file,
> >>> everything I can think of within NIFI ends up at least running through
> the
> >>> whole file.
> >>>
> >>> I am using RouteText suggested at one time on separate thread.
> >>>
> >>> Routing Strategy: Route to "matched" if the line matches any condition.
> >>> Matching Strategy: Satisfies Expression
> >>> My expression: ${lineNo:lt(2):and($line:find('ERROR')})}
> >>>
> >>> I then route "matched" to auto-terminate and unmatched as my "new" file
> >>> without the first line.
> >>>
> >>> This seems to be working but it is slow since I believe it still at
> least
> >>> runs through the whole file line by line.
> >>>
> >>> Is there any other suggestions?  I've read the "ExecuteGroovy"
> solutions
> >>> but they seem excessive if all I want is to remove first line of file.
> >>>
> >>> I've also looked at ReplaceText and thought that would give me a clean
> >>> solution since I thought I could control input stream with the "Maximum
> >>> Buffer Size" but that is a conditional setting and if "evaluation
> Mode" is
> >>> Line-by-Line then I later learned "Maximum Buffer" is only for the
> buffer
> >>> size of the line.
> >>>
> >>> Thanks
> >>
> >>
>

Reply via email to