Re: Delimiter splitting in ExtractText possible?

Jason Tarasovic Wed, 23 Nov 2016 06:41:52 -0800

Prabhu,

It's possible to do what you're asking but not especially efficient. You
can SplitText twice (10,000 and then 1) outputting the header on each and
then running the result through ExtractText. Your regex would be something
like ([^,]*?),([^,]*),.... so match 0 or more non-comma characters followed
by a comma. ExtractText will place the matched capture groups into
attributes like you mentioned (date.1->the_captured_text)

However, it's not super efficient or at least it hasn't been in my case as
you're moving the FlowFile contents into attributes and the attributes are
stored in memory so, depending on how large the file is, you *may*
experience excessive GC activity or OOM errors.

Using InferAvroSchema (if you don't know the schema in advance) and then
using ConvertCSVtoAvro may be better option depending on where the data is
ultimately going. One caveat though is that ConvertCSVtoAvro seems to only
work with properly quoted and escaped CSV that conforms to RFC 4180.

I'm just getting started with NiFi myself so not an expert or anything but
I hope that helps.

-Jason

On Tue, Nov 22, 2016 at 3:34 AM, prabhu Mahendran <[email protected]>
wrote:

> Hi All,
>
> I have CSV unstructured data with comma as delimiter which contains 100
> rows.
>
> Is it possible to extract the data's in csv file using comma as seperator
> in nifi processors.
>
>
> *See my Sample data 3 from 100 rows.*
>
> *No,Name,Age,PAN,City*
> *1,Siva,22,91230,Londan,*
> *2,,23,91231,UK*
>
> *3,Greck,22,,US*
>
> In 1st row having all values which can be seperated by "data" attribute
> having regex *(.+),(.+),(.+),(.+),(.+)* then row will be split like
> below..,
>
>                 data.1-->1
>                 data.2-->Siva
>                 data.3-->22
>                 data.4-->91230
>                 data.5-->Londan
>
> But in Second row which having Empty values in Name column can using regex
> (.+),,(.+),(.+),(.+) then row will be split like below..,
>
>                data.1-->2
>                data.2-->23
>                data.3-->91231
>                data.4-->UK
>
> Third row same as PAN Column empty it can able to split using another
> regex attribute.
>
> But my problem is now data having 100 rows.In future this may having
> another 100 rows.So again need to write more regex attributes to capture
> group wise .
>
>
> *So I think  i have given comma(,) as common regex for all rows in csv
> file then it will split data as into data.1,data.2,...data.5 *
>
>
>
>
>
> *But i gets an validation failed error in Bulletins Indicator in
> ExtractTextProcessor.So is this possible to write delimiter wise splitting
> of rows in CSV File?Is this possible to write common regex for all csv data
> in ExtractText or any other processor?*
>
>

Re: Delimiter splitting in ExtractText possible?

Reply via email to