Re: NIFI Usage for Data Transformation

Ameer Mawia Thu, 01 Nov 2018 13:21:20 -0700

Inline.

On Thu, Nov 1, 2018 at 1:40 PM Bryan Bende <bbe...@gmail.com> wrote:


> How big are the initial CSV files?
>
> If they are large, like millions of lines, or even hundreds of
> thousands, then it will be ideal if you can avoid the line-by-line
> split, and instead process the lines in place.
>
> Not million. But definitely ranging from 10s to 100s of thousand.


> This is one of the benefits of the record processors. For example,
> with UpdateRecord you can read in a large CSV line by line, apply an
> update to each line, and write it back out. So you only ever have one
> flow file.
>
> Agreed.


> It sounds like you may have a significant amount of custom logic so
> you may need a custom processor,

Yes. Each record has its own logic. On top of that some time multiple data
source are referred to determine the final value of the output field.

> but you can still take this approach
> of reading a single flow file line by line, and writie out the results
> line by line (try to avoid reading the entire content into memory at
> one time).
>
That what I am trying.


> On Thu, Nov 1, 2018 at 1:22 PM Ameer Mawia <ameer.ma...@gmail.com> wrote:
> >
> > Thanks for the input folks.
> >
> > I had this impression that for actual processing of the data :
> >
> > we may have to put in place a custom processor which will have the
> transformation framework logic in it.
> > Or we can use ExcecuteProcess processor to trigger an external
> process(which will be this transformation logic) and route back the output
> in the NIFI.
> >
> > Our flow inside the framework generally looks like this:
> >
> > Split the CSV file line by line.
> > For each line Split it in array of string.
> > For each record in the array determine its invoke it transformation
> method.
> > Transformation Method contains the transformation logic. This logic can
> be pretty intensive like:
> >
> > searching for hundreds of different pattern.
> > lookup against hundreds of configured string constants.
> > Appending/Prepending/Trimming/Padding...
> >
> > Finally map the each record into an output csv format.
> >
> > So far we have been trying to see if SplitRecord, UpdateRecord,
> ExtractText, etc can come in handy?
> >
> > Thanks,
> >
> > On Thu, Nov 1, 2018 at 12:39 PM Mike Thomsen <mikerthom...@gmail.com>
> wrote:
> >>
> >> Ameer,
> >>
> >> Depending on how you implemented the custom framework, you may be able
> to easily drop it in place into a custom NiFi processor. Without knowing
> much about your implementation details, if you can act on Java streams,
> Strings, byte arrays and things like that it will probably be very straight
> forward to drop in place.
> >>
> >> This is a really simple of how you could bring it in depending on how
> encapsulated your business logic is:
> >>
> >> @Override
> >> public void onTrigger(ProcessContext context, ProcessSession session)
> throws ProcessException {
> >>     FlowFile input = session.get();
> >>     if (input == null) {
> >>         return;
> >>     }
> >>
> >>     FlowFile output = session.create(input);
> >>     try (InputStream is = session.read(input);
> >>         OutputStream os = session.write(output)
> >>     ) {
> >>         transformerPojo.transform(is, os);
> >>
> >>         is.close();
> >>         os.close();
> >>
> >>         session.transfer(input, REL_ORIGINAL); //If you created an
> "original relationship"
> >>         session.transfer(output, REL_SUCCESS);
> >>     } catch (Exception ex) {
> >>         session.remove(output);
> >>         session.transfer(input, REL_FAILURE);
> >>     }
> >> }
> >>
> >> That's the general idea, and that approach can scale to your disk space
> limits. Hope that helps put it into perspective.
> >>
> >> Mike
> >>
> >> On Thu, Nov 1, 2018 at 10:16 AM Nathan Gough <thena...@gmail.com>
> wrote:
> >>>
> >>> Hi Ameer,
> >>>
> >>> This blog by Mark Payne describes how to manipulate record based data
> like CSV using schemas:
> https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi. This
> would probably be the most efficient method. And another here:
> https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries
> .
> >>>
> >>> An alternative option would be to port your custom java code into your
> own NiFi processor:
> >>>
> https://medium.com/hashmapinc/creating-custom-processors-and-controllers-in-apache-nifi-e14148740ea
> under 'Steps for Creating a Custom Apache NiFi Processor'
> >>> https://nifi.apache.org/developer-guide.html
> >>>
> >>> Nathan
> >>>
> >>> On 10/31/18, 5:02 PM, "Ameer Mawia" <ameer.ma...@gmail.com> wrote:
> >>>
> >>>     We have a use case where we take data from a source(text data in
> csv
> >>>     format), do transformation and manipulation of textual record, and
> output
> >>>     the data in another (csv)format. This is being done by a Java
> based custom
> >>>     framework, written specifically for this *transformation* piece.
> >>>
> >>>     Recently as Apache NIFI is being adopted at enterprise level by the
> >>>     organisation, we have been asked to try *Apache NIFI* and see if
> can use
> >>>     that as a replacement to this custom tool?
> >>>
> >>>     *My question is*:
> >>>
> >>>        - How much leverage does *Apache NIFI *provides on the flowfile
> *content
> >>>        *manipulation?
> >>>
> >>>     I understand *NIFI *is good for creating data flow pipeline, but
> is it good
> >>>     for *extensive TEXT Transformation* as well?   So far I have not
> found
> >>>     obvious way to achieve that.
> >>>
> >>>     Appreciate the feedback.
> >>>
> >>>     Thanks,
> >>>
> >>>     --
> >>>     http://ca.linkedin.com/in/ameermawia
> >>>     Toronto, ON
> >>>
> >>>
> >>>
> >
> >
> > --
> > http://ca.linkedin.com/in/ameermawia
> > Toronto, ON
> >
>


-- 
http://ca.linkedin.com/in/ameermawia
Toronto, ON

Re: NIFI Usage for Data Transformation

Reply via email to