I’ve been playing around with it but I’m not sure how to do the kind of 
validation I need. Consider this CSV. How would I validate this with 
ValidateCSV?

Good CSV
c1,c2,c3
hello,world,1
hello,world,
hello,,

Bad CSV
c1,c2,c3
hello,world,1
hello,world
hello

From: Emanuel Oliveira <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, January 7, 2020 at 12:21 PM
To: "[email protected]" <[email protected]>
Subject: Re: Validating CSV File

ValidateCsv is the most robust (handles missing fields as your need), it 
doesn't use Avro Schemas, instead use inline sequence of functions to 
accomplish anything you want (nulls ok or not, types, regex etc).

In recent project while struggling for maximum data quality i tried all 
different processors and options and ValidateCsv is the clear winner for CSVs.

Emanuel O.

On Mon 6 Jan 2020, 23:36 Matt Burgess, 
<[email protected]<mailto:[email protected]>> wrote:
What about ValidateCsv, could that do what you want?
Sent from my iPhone


On Jan 6, 2020, at 6:10 PM, Shawn Weeks 
<[email protected]<mailto:[email protected]>> wrote:
I’m poking around to see if I can make the csv parsers fail on a schema 
mismatch like that. A stream command would be a good option though.

Thanks
Shawn

From: Mike Thomsen <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, January 6, 2020 at 4:35 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Validating CSV File

We have a lot of the same issues where I work, and our solution is to use 
ExecuteStreamCommand to pass CSVs off to Python scripts that will read stdin 
line by line to check to see if the export isn't screwed up. Some of our 
sources are good and we don't have to do that, but others are minefields in 
terms of the quality of the upstream data source, and that's the only way we've 
found where we can predictably handle such things.

On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks 
<[email protected]<mailto:[email protected]>> wrote:
That's the challenge, the values can be null but I want to know the fields are 
missing(aka not enough delimiters). I run into a common scenario where line 
feeds end up in the data making a short row. Currently the reader just ignores 
the fact that there aren't enough delimiters and makes them null.

On 1/6/20, 3:50 PM, "Matt Burgess" 
<[email protected]<mailto:[email protected]>> wrote:

    Shawn,

    Your schema indicates that the fields are optional because of the
    "type" :  ["null", "string"] , so IIRC they won't be marked as invalid
    because they are treated as null (I'm not sure there's a difference in
    the code between missing and null fields).

    You can try "type": "string" in ValidateRecord to see if that fixes
    it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.

    Regards,
    Matt

    On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks 
<[email protected]<mailto:[email protected]>> wrote:
    >
    > I’m trying to validate that a csv file has the number of fields defined 
in it’s Avro schema. Consider the following schema and CSVs. I would like to be 
able to reject the invalid csv as missing fields.
    >
    >
    >
    > {
    >
    >    "type" : "record",
    >
    >    "namespace" : "nifi",
    >
    >    "name" : "nifi",
    >
    >    "fields" : [
    >
    >       { "name" : "c1" , "type" :  ["null", "string"] },
    >
    >       { "name" : "c2" , "type" : ["null", "string"] },
    >
    >       { "name" : "c3" , "type" : ["null", "string"] }
    >
    >    ]
    >
    > }
    >
    >
    >
    > Good CSV
    >
    > c1,c2,c3
    >
    > hello,world,1
    >
    > hello,world,
    >
    > hello,,
    >
    >
    >
    > Bad CSV
    >
    > c1,c2,c3
    >
    > hello,world,1
    >
    > hello,world
    >
    > hello
    >
    >

Reply via email to