Thank you all for the feedback, I went ahead and ran this solution for my use case:
Lots of big files that may or may not have the first line contain the word ERROR ... if it has the word ERROR I need to remove that line. 1) RouteOnContent and added a property ERROR with "ERROR" as value and set the content buffer size to "100 KB" 2) ERROR relationship to ExecuteStreamCommand and run a tail -n +2 to remove first line and keep rest of flowFile ... send output stream relation further through the dataflow life cycle. On Mon, Mar 6, 2017 at 8:27 PM Mark Payne <[email protected]> wrote: > Joe, > > In terms of updating a processor to better handle this, we could update > RouteText to do so. The idea being that if a large percentage of the time > (say > 80% of the time) the FlowFile routed to one of the relationships is > a single, contiguous subset of the data in the original FlowFile, then we > could perform a 2-pass algorithm. In the first pass, we check to see if > this is the case. If so, we can use session.clone(flowFile, offset, length) > and we are done, without writing any content. If this is not the case, then > we would have to perform a second pass over the data and write out the > results as we do now. > > It is certainly not a trivial update to the processor but it is something > that can be hidden from the user and in cases like this can provide > significantly better performance. Something to think about. > > -Mark > > Sent from my iPhone > > > On Mar 6, 2017, at 5:25 PM, Joe Witt <[email protected]> wrote: > > > > Juan, > > > > I think RouteText is the right answer. It would indeed need to check > > all lines to determine whether the condition is satisfied and to > > remove the first line it will need to write out all the remaining > > lines. If a majority of the input files do not have this problematic > > header I'd use RouteOnContent with a small buffer (however many bytes > > would be in the header line of an erroneous file) and check for the > > presence of "ERROR". If it did hit then I'd route to RouteText to do > > this more expensive piece. If it didn't hit then you can move it on > > without paying the RouteText cost. > > > > It is useful to consider we'd have a processor, or update an existing > > one to handle removal of lines from the beginning or end based on some > > conditional. Not sure what that would look like as the requirements > > can get pretty specific. I do think generally for such cases that > > ExecuteScript processors offer an excellent tradeoff such that one can > > build a very small focused and fast script to do precisely what they > > need. > > > > Thanks > > Joe > > > >> On Mon, Mar 6, 2017 at 4:50 PM, Lee Laim <[email protected]> wrote: > >> Juan, > >> > >> If you're in the linux environment, you can use the Execute Stream > command > >> (ESC) processor to run "head -n 1" the contents of the incoming large > >> flowfile to quickly extract the first line. ESC has an option to put > the > >> output of the command directly into a new attribute, and pass the > "original > >> contents" to the next processor. The value of the new attribute > contains > >> the first line, while the entire file remains in the flowfile > contents. You > >> can use the new attribute for quick(er) routing decision. > >> > >> Thanks, > >> Lee > >> > >> > >> > >>> On Mon, Mar 6, 2017 at 1:46 PM, Juan Sequeiros <[email protected]> > wrote: > >>> > >>> Good afternoon all,, > >>> > >>> I am trying to remove the first line of a file if it has a certain > word in > >>> it "ERROR" > >>> I know it will exist only in the first line ( I can not fix the reason > why > >>> it gets put there ) > >>> > >>> These files are big and lots of them. > >>> > >>> and I can not find a "fast" fix to pop the first line of a file, > >>> everything I can think of within NIFI ends up at least running through > the > >>> whole file. > >>> > >>> I am using RouteText suggested at one time on separate thread. > >>> > >>> Routing Strategy: Route to "matched" if the line matches any condition. > >>> Matching Strategy: Satisfies Expression > >>> My expression: ${lineNo:lt(2):and($line:find('ERROR')})} > >>> > >>> I then route "matched" to auto-terminate and unmatched as my "new" file > >>> without the first line. > >>> > >>> This seems to be working but it is slow since I believe it still at > least > >>> runs through the whole file line by line. > >>> > >>> Is there any other suggestions? I've read the "ExecuteGroovy" > solutions > >>> but they seem excessive if all I want is to remove first line of file. > >>> > >>> I've also looked at ReplaceText and thought that would give me a clean > >>> solution since I thought I could control input stream with the "Maximum > >>> Buffer Size" but that is a conditional setting and if "evaluation > Mode" is > >>> Line-by-Line then I later learned "Maximum Buffer" is only for the > buffer > >>> size of the line. > >>> > >>> Thanks > >> > >> >
