It would be very useful indeed. Thanks very much for your comments. On Fri, Jan 3, 2020 at 1:29 PM Joe Witt <joe.w...@gmail.com> wrote:
> because splitting them just to achieve this means creating potentially > thousands of flowfiles rather than playing the data as it lies and in its > most efficient form. > > The idea to enable certain automatically managed things which one could > inject into their records like record number (relative to that bundle of > records) is cool approach/idea that wont add overhead. > > thanks > > On Fri, Jan 3, 2020 at 1:27 PM James McMahon <jsmcmah...@gmail.com> wrote: > >> >> I was wondering why SplitContent or SegmentContent are bad ideas to >> approach this requirement? Each gives us a fragment.index attribute that >> potentially could be prepended to any content. Would such an approach be >> impractical for very large flow files, and so perhaps would impose >> excessive demands on the JVM? Is it because there is no reliable indicator >> where each record ends in the flow file? >> I was thinking that if the flow files were of manageable size in terms of >> record count, why not break them apart, add the fragment.index attribute >> where you want in each record, and then re-merge them. >> Probably something you considered and found to be inadequate. I was >> interested in learning more. >> >> On Fri, Jan 3, 2020 at 10:24 AM Shawn Weeks <swe...@weeksconsulting.us> >> wrote: >> >>> Adding an additional attribute to UpdateRecord sounds pretty straight >>> forward only thing I'm not sure about is where to store the state between >>> each calls to UpdateRecord.process. It will also would be nicer if >>> UpdateRecord could update schema but I think there is already a Jira for >>> that. >>> >>> Thanks >>> Shawn >>> >>> On 1/2/20, 4:14 PM, "Matt Burgess" <mattyb...@apache.org> wrote: >>> >>> Shawn, >>> >>> This seems like something we could do in UpdateRecord by supplying a >>> synthetic attribute called "record.index" or "record.number" or >>> something, so you can use Expression Language for updating the field. >>> It may also be possible to do something with Calcite and QueryRecord >>> but that might be overkill depending on how/if it would be >>> implemented >>> (push-down predicate, synthetic columns, e.g.). >>> >>> Please feel free to write an improvement Jira for the UpdateRecord >>> processor if that satisfies your use case. >>> >>> Regards, >>> Matt >>> >>> On Thu, Jan 2, 2020 at 11:44 AM Shawn Weeks < >>> swe...@weeksconsulting.us> wrote: >>> > >>> > I have a use case where I need to append a row number to every >>> record in a flow file. Not everything I receive is text so the only >>> guarantee I have is that I have a record reader for each type of file. I >>> started out looking at the row_number window function in QueryRecord but >>> after looking at things for the past few days that isn’t going to work >>> since Calcite requires everything to be in memory to execute the window >>> function. >>> > >>> > >>> > >>> > So I’m looking for another approach that can take advantage of the >>> record api. It seems like it would be possible to write a counter function >>> for UpdateRecord but I don’t know enough about the api to know if there is >>> a context maintained throughout a given flow file to stash the variable in. >>> > >>> > Thanks >>> > >>> > Shawn Weeks >>> >>> >>>