I do agree that this is something that is not easily done with ValidateRecord 
(and probably cannot be done with ValidateRecord). But it's something that I 
think ValidateRecord *should* allow for. To that end, I've created a JIRA [1] 
to track this.

Thanks
-Mark

[1] https://issues.apache.org/jira/browse/NIFI-6986 
<https://issues.apache.org/jira/browse/NIFI-6986>

> On Jan 7, 2020, at 1:24 PM, Shawn Weeks <[email protected]> wrote:
> 
> I’ve been playing around with it but I’m not sure how to do the kind of 
> validation I need. Consider this CSV. How would I validate this with 
> ValidateCSV?
>  
> Good CSV
> c1,c2,c3
> hello,world,1
> hello,world,
> hello,,
>  
> Bad CSV
> c1,c2,c3
> hello,world,1
> hello,world
> hello
>  
> From: Emanuel Oliveira <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Tuesday, January 7, 2020 at 12:21 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: Validating CSV File
>  
> ValidateCsv is the most robust (handles missing fields as your need), it 
> doesn't use Avro Schemas, instead use inline sequence of functions to 
> accomplish anything you want (nulls ok or not, types, regex etc).
>  
> In recent project while struggling for maximum data quality i tried all 
> different processors and options and ValidateCsv is the clear winner for CSVs.
>  
> Emanuel O.
>  
> On Mon 6 Jan 2020, 23:36 Matt Burgess, <[email protected] 
> <mailto:[email protected]>> wrote:
> What about ValidateCsv, could that do what you want?
> 
> Sent from my iPhone
> 
> 
> On Jan 6, 2020, at 6:10 PM, Shawn Weeks <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> I’m poking around to see if I can make the csv parsers fail on a schema 
> mismatch like that. A stream command would be a good option though.
>  
> Thanks
> Shawn
>  
> From: Mike Thomsen <[email protected] <mailto:[email protected]>>
> Reply-To: "[email protected] <mailto:[email protected]>" 
> <[email protected] <mailto:[email protected]>>
> Date: Monday, January 6, 2020 at 4:35 PM
> To: "[email protected] <mailto:[email protected]>" 
> <[email protected] <mailto:[email protected]>>
> Subject: Re: Validating CSV File
>  
> We have a lot of the same issues where I work, and our solution is to use 
> ExecuteStreamCommand to pass CSVs off to Python scripts that will read stdin 
> line by line to check to see if the export isn't screwed up. Some of our 
> sources are good and we don't have to do that, but others are minefields in 
> terms of the quality of the upstream data source, and that's the only way 
> we've found where we can predictably handle such things.
>  
> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <[email protected] 
> <mailto:[email protected]>> wrote:
> That's the challenge, the values can be null but I want to know the fields 
> are missing(aka not enough delimiters). I run into a common scenario where 
> line feeds end up in the data making a short row. Currently the reader just 
> ignores the fact that there aren't enough delimiters and makes them null.
> 
> On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     Shawn,
> 
>     Your schema indicates that the fields are optional because of the
>     "type" :  ["null", "string"] , so IIRC they won't be marked as invalid
>     because they are treated as null (I'm not sure there's a difference in
>     the code between missing and null fields).
> 
>     You can try "type": "string" in ValidateRecord to see if that fixes
>     it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.
> 
>     Regards,
>     Matt
> 
>     On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <[email protected] 
> <mailto:[email protected]>> wrote:
>     >
>     > I’m trying to validate that a csv file has the number of fields defined 
> in it’s Avro schema. Consider the following schema and CSVs. I would like to 
> be able to reject the invalid csv as missing fields.
>     >
>     >
>     >
>     > {
>     >
>     >    "type" : "record",
>     >
>     >    "namespace" : "nifi",
>     >
>     >    "name" : "nifi",
>     >
>     >    "fields" : [
>     >
>     >       { "name" : "c1" , "type" :  ["null", "string"] },
>     >
>     >       { "name" : "c2" , "type" : ["null", "string"] },
>     >
>     >       { "name" : "c3" , "type" : ["null", "string"] }
>     >
>     >    ]
>     >
>     > }
>     >
>     >
>     >
>     > Good CSV
>     >
>     > c1,c2,c3
>     >
>     > hello,world,1
>     >
>     > hello,world,
>     >
>     > hello,,
>     >
>     >
>     >
>     > Bad CSV
>     >
>     > c1,c2,c3
>     >
>     > hello,world,1
>     >
>     > hello,world
>     >
>     > hello
>     >
>     >
> 

Reply via email to