What about ValidateCsv, could that do what you want?

Sent from my iPhone

> On Jan 6, 2020, at 6:10 PM, Shawn Weeks <[email protected]> wrote:
> 
> 
> I’m poking around to see if I can make the csv parsers fail on a schema 
> mismatch like that. A stream command would be a good option though.
>  
> Thanks
> Shawn
>  
> From: Mike Thomsen <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Monday, January 6, 2020 at 4:35 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: Validating CSV File
>  
> We have a lot of the same issues where I work, and our solution is to use 
> ExecuteStreamCommand to pass CSVs off to Python scripts that will read stdin 
> line by line to check to see if the export isn't screwed up. Some of our 
> sources are good and we don't have to do that, but others are minefields in 
> terms of the quality of the upstream data source, and that's the only way 
> we've found where we can predictably handle such things.
>  
> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <[email protected]> wrote:
> That's the challenge, the values can be null but I want to know the fields 
> are missing(aka not enough delimiters). I run into a common scenario where 
> line feeds end up in the data making a short row. Currently the reader just 
> ignores the fact that there aren't enough delimiters and makes them null.
> 
> On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected]> wrote:
> 
>     Shawn,
> 
>     Your schema indicates that the fields are optional because of the
>     "type" :  ["null", "string"] , so IIRC they won't be marked as invalid
>     because they are treated as null (I'm not sure there's a difference in
>     the code between missing and null fields).
> 
>     You can try "type": "string" in ValidateRecord to see if that fixes
>     it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.
> 
>     Regards,
>     Matt
> 
>     On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <[email protected]> 
> wrote:
>     >
>     > I’m trying to validate that a csv file has the number of fields defined 
> in it’s Avro schema. Consider the following schema and CSVs. I would like to 
> be able to reject the invalid csv as missing fields.
>     >
>     >
>     >
>     > {
>     >
>     >    "type" : "record",
>     >
>     >    "namespace" : "nifi",
>     >
>     >    "name" : "nifi",
>     >
>     >    "fields" : [
>     >
>     >       { "name" : "c1" , "type" :  ["null", "string"] },
>     >
>     >       { "name" : "c2" , "type" : ["null", "string"] },
>     >
>     >       { "name" : "c3" , "type" : ["null", "string"] }
>     >
>     >    ]
>     >
>     > }
>     >
>     >
>     >
>     > Good CSV
>     >
>     > c1,c2,c3
>     >
>     > hello,world,1
>     >
>     > hello,world,
>     >
>     > hello,,
>     >
>     >
>     >
>     > Bad CSV
>     >
>     > c1,c2,c3
>     >
>     > hello,world,1
>     >
>     > hello,world
>     >
>     > hello
>     >
>     >
> 

Reply via email to