Re: Adding a Row Number to Records

James McMahon Fri, 03 Jan 2020 10:32:19 -0800

It would be very useful indeed. Thanks very much for your comments.

On Fri, Jan 3, 2020 at 1:29 PM Joe Witt <joe.w...@gmail.com> wrote:


> because splitting them just to achieve this means creating potentially
> thousands of flowfiles rather than playing the data as it lies and in its
> most efficient form.
>
> The idea to enable certain automatically managed things which one could
> inject into their records like record number (relative to that bundle of
> records) is cool approach/idea that wont add overhead.
>
> thanks
>
> On Fri, Jan 3, 2020 at 1:27 PM James McMahon <jsmcmah...@gmail.com> wrote:
>
>>
>> I was wondering why SplitContent or SegmentContent are bad ideas to
>> approach this requirement? Each gives us a fragment.index attribute that
>> potentially could be prepended to any content. Would such an approach be
>> impractical for very large flow files, and so perhaps would impose
>> excessive demands on the JVM? Is it because there is no reliable indicator
>> where each record ends in the flow file?
>> I was thinking that if the flow files were of manageable size in terms of
>> record count, why not break them apart, add the fragment.index attribute
>> where you want in each record, and then re-merge them.
>> Probably something you considered and found to be inadequate. I was
>> interested in learning more.
>>
>> On Fri, Jan 3, 2020 at 10:24 AM Shawn Weeks <swe...@weeksconsulting.us>
>> wrote:
>>
>>> Adding an additional attribute to UpdateRecord sounds pretty straight
>>> forward only thing I'm not sure about is where to store the state between
>>> each calls to UpdateRecord.process. It will also would be nicer if
>>> UpdateRecord could update schema but I think there is already a Jira for
>>> that.
>>>
>>> Thanks
>>> Shawn
>>>
>>> On 1/2/20, 4:14 PM, "Matt Burgess" <mattyb...@apache.org> wrote:
>>>
>>>     Shawn,
>>>
>>>     This seems like something we could do in UpdateRecord by supplying a
>>>     synthetic attribute called "record.index" or "record.number" or
>>>     something, so you can use Expression Language for updating the field.
>>>     It may also be possible to do something with Calcite and QueryRecord
>>>     but that might be overkill depending on how/if it would be
>>> implemented
>>>     (push-down predicate, synthetic columns, e.g.).
>>>
>>>     Please feel free to write an improvement Jira for the UpdateRecord
>>>     processor if that satisfies your use case.
>>>
>>>     Regards,
>>>     Matt
>>>
>>>     On Thu, Jan 2, 2020 at 11:44 AM Shawn Weeks <
>>> swe...@weeksconsulting.us> wrote:
>>>     >
>>>     > I have a use case where I need to append a row number to every
>>> record in a flow file. Not everything I receive is text so the only
>>> guarantee I have is that I have a record reader for each type of file. I
>>> started out looking at the row_number window function in QueryRecord but
>>> after looking at things for the past few days that isn’t going to work
>>> since Calcite requires everything to be in memory to execute the window
>>> function.
>>>     >
>>>     >
>>>     >
>>>     > So I’m looking for another approach that can take advantage of the
>>> record api. It seems like it would be possible to write a counter function
>>> for UpdateRecord but I don’t know enough about the api to know if there is
>>> a context maintained throughout a given flow file to stash the variable in.
>>>     >
>>>     > Thanks
>>>     >
>>>     > Shawn Weeks
>>>
>>>
>>>

Re: Adding a Row Number to Records

Reply via email to