Shawn, Please see the Jira that I referenced below. It explains how to determine whether or not a field exists in the record.
Thanks -Mark > On Jan 7, 2020, at 3:38 PM, Shawn Weeks <[email protected]> wrote: > > Seems like all you need is for the getValue method on the Record Interface to > return something different if the Field doesn't exist instead of null. Maybe > an enum? Then you could correctly determine that a nullable field didn't > exist vs was just null in ValidateRecord. > > On 1/7/20, 2:03 PM, "Shawn Weeks" <[email protected]> wrote: > > So I'm playing around with it some more and ValidateRecord is kinda > working but it's finding stuff valid that isn't. Shouldn't some of those > records have been rejected? I realized that by writing out to AVRO it was > stubbing out the records but the actual input didn't have them. > > This is the input > c1,c2,c3 > hello,world,1 > hello,world > hello > hello,, > > The schema > { > "type" : "record", > "namespace" : "nifi", > "name" : "nifi", > "fields" : [ > { "name" : "c1" , "type" : ["null","string"] }, > { "name" : "c2" , "type" : ["null","string"] }, > { "name" : "c3" , "type" : ["null","string"] } > ] > } > > The output in JSON > [ { > "c1" : "hello", > "c2" : "world", > "c3" : "1" > }, { > "c1" : "hello", > "c2" : "world" > }, { > "c1" : "hello" > }, { > "c1" : "hello", > "c2" : null, > "c3" : null > } ] > > On 1/7/20, 1:14 PM, "Shawn Weeks" <[email protected]> wrote: > > I think to do that we'll need a way to distinguish between fields that > were there and blank and fields that were never populated. Maybe an extension > of the record api itself to distinguish between null and not there. Or > allowing the CSV Reader to redefine null to something like empty string. > > I welcome any ideas. > > Thanks > Shawn > > On 1/7/20, 1:07 PM, "Matt Burgess" <[email protected]> wrote: > > I believe ValidateRecord would route the flow file to failure in > that > case. If you know the kind of data going in, then you might just be > able to treat the failure relationship more like the invalid > relationship. However I like the idea of doing it in > ValidateRecord as > Mark proposed in the Jira. > > On Tue, Jan 7, 2020 at 2:02 PM Shawn Weeks > <[email protected]> wrote: >> >> I see where I think we could do this in the CSV Reader itself. Jackson >> actually has an option to fail on missing fields and for csv commons we know >> how many fields we have and how many are in a record. Maybe just add a >> couple of true/false fields to fail on too few or too many fields like we do >> with PutDatabaseRecord. Not sure how to integrate that into ValidateRecord >> though as by that point it’s already parses. >> >> >> >> Thanks >> >> Shawn >> >> >> >> From: Mark Payne <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, January 7, 2020 at 12:52 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: Validating CSV File >> >> >> >> I do agree that this is something that is not easily done with >> ValidateRecord (and probably cannot be done with ValidateRecord). But it's >> something that I think ValidateRecord *should* allow for. To that end, I've >> created a JIRA [1] to track this. >> >> >> >> Thanks >> >> -Mark >> >> >> >> [1] https://issues.apache.org/jira/browse/NIFI-6986 >> >> >> >> >> >> On Jan 7, 2020, at 1:24 PM, Shawn Weeks <[email protected]> wrote: >> >> >> >> I’ve been playing around with it but I’m not sure how to do the kind of >> validation I need. Consider this CSV. How would I validate this with >> ValidateCSV? >> >> >> >> Good CSV >> >> c1,c2,c3 >> >> hello,world,1 >> >> hello,world, >> >> hello,, >> >> >> >> Bad CSV >> >> c1,c2,c3 >> >> hello,world,1 >> >> hello,world >> >> hello >> >> >> >> From: Emanuel Oliveira <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, January 7, 2020 at 12:21 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: Validating CSV File >> >> >> >> ValidateCsv is the most robust (handles missing fields as your need), it >> doesn't use Avro Schemas, instead use inline sequence of functions to >> accomplish anything you want (nulls ok or not, types, regex etc). >> >> >> >> In recent project while struggling for maximum data quality i tried all >> different processors and options and ValidateCsv is the clear winner for >> CSVs. >> >> >> >> Emanuel O. >> >> >> >> On Mon 6 Jan 2020, 23:36 Matt Burgess, <[email protected]> wrote: >> >> What about ValidateCsv, could that do what you want? >> >> Sent from my iPhone >> >> >> >> >> On Jan 6, 2020, at 6:10 PM, Shawn Weeks <[email protected]> wrote: >> >> I’m poking around to see if I can make the csv parsers fail on a schema >> mismatch like that. A stream command would be a good option though. >> >> >> >> Thanks >> >> Shawn >> >> >> >> From: Mike Thomsen <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Monday, January 6, 2020 at 4:35 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: Validating CSV File >> >> >> >> We have a lot of the same issues where I work, and our solution is to use >> ExecuteStreamCommand to pass CSVs off to Python scripts that will read stdin >> line by line to check to see if the export isn't screwed up. Some of our >> sources are good and we don't have to do that, but others are minefields in >> terms of the quality of the upstream data source, and that's the only way >> we've found where we can predictably handle such things. >> >> >> >> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <[email protected]> wrote: >> >> That's the challenge, the values can be null but I want to know the fields >> are missing(aka not enough delimiters). I run into a common scenario where >> line feeds end up in the data making a short row. Currently the reader just >> ignores the fact that there aren't enough delimiters and makes them null. >> >> On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected]> wrote: >> >> Shawn, >> >> Your schema indicates that the fields are optional because of the >> "type" : ["null", "string"] , so IIRC they won't be marked as invalid >> because they are treated as null (I'm not sure there's a difference in >> the code between missing and null fields). >> >> You can try "type": "string" in ValidateRecord to see if that fixes >> it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV. >> >> Regards, >> Matt >> >> On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <[email protected]> >> wrote: >>> >>> I’m trying to validate that a csv file has the number of fields defined in >>> it’s Avro schema. Consider the following schema and CSVs. I would like to >>> be able to reject the invalid csv as missing fields. >>> >>> >>> >>> { >>> >>> "type" : "record", >>> >>> "namespace" : "nifi", >>> >>> "name" : "nifi", >>> >>> "fields" : [ >>> >>> { "name" : "c1" , "type" : ["null", "string"] }, >>> >>> { "name" : "c2" , "type" : ["null", "string"] }, >>> >>> { "name" : "c3" , "type" : ["null", "string"] } >>> >>> ] >>> >>> } >>> >>> >>> >>> Good CSV >>> >>> c1,c2,c3 >>> >>> hello,world,1 >>> >>> hello,world, >>> >>> hello,, >>> >>> >>> >>> Bad CSV >>> >>> c1,c2,c3 >>> >>> hello,world,1 >>> >>> hello,world >>> >>> hello >>> >>> >> >> >> > > > > > >
