How big are the initial CSV files?

If they are large, like millions of lines, or even hundreds of
thousands, then it will be ideal if you can avoid the line-by-line
split, and instead process the lines in place.

This is one of the benefits of the record processors. For example,
with UpdateRecord you can read in a large CSV line by line, apply an
update to each line, and write it back out. So you only ever have one
flow file.

It sounds like you may have a significant amount of custom logic so
you may need a custom processor, but you can still take this approach
of reading a single flow file line by line, and writie out the results
line by line (try to avoid reading the entire content into memory at
one time).


On Thu, Nov 1, 2018 at 1:22 PM Ameer Mawia <ameer.ma...@gmail.com> wrote:
>
> Thanks for the input folks.
>
> I had this impression that for actual processing of the data :
>
> we may have to put in place a custom processor which will have the 
> transformation framework logic in it.
> Or we can use ExcecuteProcess processor to trigger an external process(which 
> will be this transformation logic) and route back the output in the NIFI.
>
> Our flow inside the framework generally looks like this:
>
> Split the CSV file line by line.
> For each line Split it in array of string.
> For each record in the array determine its invoke it transformation method.
> Transformation Method contains the transformation logic. This logic can be 
> pretty intensive like:
>
> searching for hundreds of different pattern.
> lookup against hundreds of configured string constants.
> Appending/Prepending/Trimming/Padding...
>
> Finally map the each record into an output csv format.
>
> So far we have been trying to see if SplitRecord, UpdateRecord, ExtractText, 
> etc can come in handy?
>
> Thanks,
>
> On Thu, Nov 1, 2018 at 12:39 PM Mike Thomsen <mikerthom...@gmail.com> wrote:
>>
>> Ameer,
>>
>> Depending on how you implemented the custom framework, you may be able to 
>> easily drop it in place into a custom NiFi processor. Without knowing much 
>> about your implementation details, if you can act on Java streams, Strings, 
>> byte arrays and things like that it will probably be very straight forward 
>> to drop in place.
>>
>> This is a really simple of how you could bring it in depending on how 
>> encapsulated your business logic is:
>>
>> @Override
>> public void onTrigger(ProcessContext context, ProcessSession session) throws 
>> ProcessException {
>>     FlowFile input = session.get();
>>     if (input == null) {
>>         return;
>>     }
>>
>>     FlowFile output = session.create(input);
>>     try (InputStream is = session.read(input);
>>         OutputStream os = session.write(output)
>>     ) {
>>         transformerPojo.transform(is, os);
>>
>>         is.close();
>>         os.close();
>>
>>         session.transfer(input, REL_ORIGINAL); //If you created an "original 
>> relationship"
>>         session.transfer(output, REL_SUCCESS);
>>     } catch (Exception ex) {
>>         session.remove(output);
>>         session.transfer(input, REL_FAILURE);
>>     }
>> }
>>
>> That's the general idea, and that approach can scale to your disk space 
>> limits. Hope that helps put it into perspective.
>>
>> Mike
>>
>> On Thu, Nov 1, 2018 at 10:16 AM Nathan Gough <thena...@gmail.com> wrote:
>>>
>>> Hi Ameer,
>>>
>>> This blog by Mark Payne describes how to manipulate record based data like 
>>> CSV using schemas: 
>>> https://blogs.apache.org/nifi/entry/record-oriented-data-with-nifi. This 
>>> would probably be the most efficient method. And another here: 
>>> https://bryanbende.com/development/2017/06/20/apache-nifi-records-and-schema-registries.
>>>
>>> An alternative option would be to port your custom java code into your own 
>>> NiFi processor:
>>> https://medium.com/hashmapinc/creating-custom-processors-and-controllers-in-apache-nifi-e14148740ea
>>>  under 'Steps for Creating a Custom Apache NiFi Processor'
>>> https://nifi.apache.org/developer-guide.html
>>>
>>> Nathan
>>>
>>> On 10/31/18, 5:02 PM, "Ameer Mawia" <ameer.ma...@gmail.com> wrote:
>>>
>>>     We have a use case where we take data from a source(text data in csv
>>>     format), do transformation and manipulation of textual record, and 
>>> output
>>>     the data in another (csv)format. This is being done by a Java based 
>>> custom
>>>     framework, written specifically for this *transformation* piece.
>>>
>>>     Recently as Apache NIFI is being adopted at enterprise level by the
>>>     organisation, we have been asked to try *Apache NIFI* and see if can use
>>>     that as a replacement to this custom tool?
>>>
>>>     *My question is*:
>>>
>>>        - How much leverage does *Apache NIFI *provides on the flowfile 
>>> *content
>>>        *manipulation?
>>>
>>>     I understand *NIFI *is good for creating data flow pipeline, but is it 
>>> good
>>>     for *extensive TEXT Transformation* as well?   So far I have not found
>>>     obvious way to achieve that.
>>>
>>>     Appreciate the feedback.
>>>
>>>     Thanks,
>>>
>>>     --
>>>     http://ca.linkedin.com/in/ameermawia
>>>     Toronto, ON
>>>
>>>
>>>
>
>
> --
> http://ca.linkedin.com/in/ameermawia
> Toronto, ON
>

Reply via email to