Re: Delimiter splitting in ExtractText possible?

Joe Witt Wed, 23 Nov 2016 07:27:40 -0800

Jason

That was an excellent response.


Prabhu - i think the question is what would you like to do with the
data?  Are you going to transform it then send it somewhere?  Do you
want to be able to filter some rows out then send the rest?  Can you
describe that part more?

The general pattern here is

It is certainly easy enough to do the two-phase split to maintain efficiency

SplitText (500 line chunks for example)
SplitText (single line chunks)
?? - what do you want to accomplish per line?
?? - where is the data going?

Thanks
Joe

On Wed, Nov 23, 2016 at 9:41 AM, Jason Tarasovic
<[email protected]> wrote:
> Prabhu,
>
> It's possible to do what you're asking but not especially efficient. You can
> SplitText twice (10,000 and then 1) outputting the header on each and then
> running the result through ExtractText. Your regex would be something like
> ([^,]*?),([^,]*),.... so match 0 or more non-comma characters followed by a
> comma. ExtractText will place the matched capture groups into attributes
> like you mentioned (date.1->the_captured_text)
>
> However, it's not super efficient or at least it hasn't been in my case as
> you're moving the FlowFile contents into attributes and the attributes are
> stored in memory so, depending on how large the file is, you *may*
> experience excessive GC activity or OOM errors.
>
> Using InferAvroSchema (if you don't know the schema in advance) and then
> using ConvertCSVtoAvro may be better option depending on where the data is
> ultimately going. One caveat though is that ConvertCSVtoAvro seems to only
> work with properly quoted and escaped CSV that conforms to RFC 4180.
>
> I'm just getting started with NiFi myself so not an expert or anything but I
> hope that helps.
>
> -Jason
>
> On Tue, Nov 22, 2016 at 3:34 AM, prabhu Mahendran <[email protected]>
> wrote:
>>
>> Hi All,
>>
>> I have CSV unstructured data with comma as delimiter which contains 100
>> rows.
>>
>> Is it possible to extract the data's in csv file using comma as seperator
>> in nifi processors.
>>
>>
>> See my Sample data 3 from 100 rows.
>>
>> No,Name,Age,PAN,City
>> 1,Siva,22,91230,Londan,
>> 2,,23,91231,UK
>> 3,Greck,22,,US
>>
>>
>> In 1st row having all values which can be seperated by "data" attribute
>> having regex (.+),(.+),(.+),(.+),(.+) then row will be split like below..,
>>
>>                 data.1-->1
>>                 data.2-->Siva
>>                 data.3-->22
>>                 data.4-->91230
>>                 data.5-->Londan
>>
>> But in Second row which having Empty values in Name column can using regex
>> (.+),,(.+),(.+),(.+) then row will be split like below..,
>>
>>                data.1-->2
>>                data.2-->23
>>                data.3-->91231
>>                data.4-->UK
>>
>> Third row same as PAN Column empty it can able to split using another
>> regex attribute.
>>
>> But my problem is now data having 100 rows.In future this may having
>> another 100 rows.So again need to write more regex attributes to capture
>> group wise .
>>
>>
>> So I think  i have given comma(,) as common regex for all rows in csv file
>> then it will split data as into data.1,data.2,...data.5
>>
>> But i gets an validation failed error in Bulletins Indicator in
>> ExtractTextProcessor.
>>
>> So is this possible to write delimiter wise splitting of rows in CSV File?
>>
>> Is this possible to write common regex for all csv data in ExtractText or
>> any other processor?
>>
>

Re: Delimiter splitting in ExtractText possible?

Reply via email to