Just saw it. It looks pretty simple and you gave plenty of details on how to implement it. Thanks
Sent from my iPhone > On Jan 7, 2020, at 2:56 PM, Mark Payne <[email protected]> wrote: > > Shawn, > > Please see the Jira that I referenced below. It explains how to determine > whether or not a field exists in the record. > > Thanks > -Mark > >> On Jan 7, 2020, at 3:38 PM, Shawn Weeks <[email protected]> wrote: >> >> Seems like all you need is for the getValue method on the Record Interface >> to return something different if the Field doesn't exist instead of null. >> Maybe an enum? Then you could correctly determine that a nullable field >> didn't exist vs was just null in ValidateRecord. >> >> On 1/7/20, 2:03 PM, "Shawn Weeks" <[email protected]> wrote: >> >> So I'm playing around with it some more and ValidateRecord is kinda >> working but it's finding stuff valid that isn't. Shouldn't some of those >> records have been rejected? I realized that by writing out to AVRO it was >> stubbing out the records but the actual input didn't have them. >> >> This is the input >> c1,c2,c3 >> hello,world,1 >> hello,world >> hello >> hello,, >> >> The schema >> { >> "type" : "record", >> "namespace" : "nifi", >> "name" : "nifi", >> "fields" : [ >> { "name" : "c1" , "type" : ["null","string"] }, >> { "name" : "c2" , "type" : ["null","string"] }, >> { "name" : "c3" , "type" : ["null","string"] } >> ] >> } >> >> The output in JSON >> [ { >> "c1" : "hello", >> "c2" : "world", >> "c3" : "1" >> }, { >> "c1" : "hello", >> "c2" : "world" >> }, { >> "c1" : "hello" >> }, { >> "c1" : "hello", >> "c2" : null, >> "c3" : null >> } ] >> >> On 1/7/20, 1:14 PM, "Shawn Weeks" <[email protected]> wrote: >> >> I think to do that we'll need a way to distinguish between fields that >> were there and blank and fields that were never populated. Maybe an >> extension of the record api itself to distinguish between null and not >> there. Or allowing the CSV Reader to redefine null to something like empty >> string. >> >> I welcome any ideas. >> >> Thanks >> Shawn >> >> On 1/7/20, 1:07 PM, "Matt Burgess" <[email protected]> wrote: >> >> I believe ValidateRecord would route the flow file to failure in >> that >> case. If you know the kind of data going in, then you might just be >> able to treat the failure relationship more like the invalid >> relationship. However I like the idea of doing it in >> ValidateRecord as >> Mark proposed in the Jira. >> >>> On Tue, Jan 7, 2020 at 2:02 PM Shawn Weeks >>> <[email protected]> wrote: >>> >>> I see where I think we could do this in the CSV Reader itself. Jackson >>> actually has an option to fail on missing fields and for csv commons we >>> know how many fields we have and how many are in a record. Maybe just add a >>> couple of true/false fields to fail on too few or too many fields like we >>> do with PutDatabaseRecord. Not sure how to integrate that into >>> ValidateRecord though as by that point it’s already parses. >>> >>> >>> >>> Thanks >>> >>> Shawn >>> >>> >>> >>> From: Mark Payne <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Tuesday, January 7, 2020 at 12:52 PM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Validating CSV File >>> >>> >>> >>> I do agree that this is something that is not easily done with >>> ValidateRecord (and probably cannot be done with ValidateRecord). But it's >>> something that I think ValidateRecord *should* allow for. To that end, I've >>> created a JIRA [1] to track this. >>> >>> >>> >>> Thanks >>> >>> -Mark >>> >>> >>> >>> [1] https://issues.apache.org/jira/browse/NIFI-6986 >>> >>> >>> >>> >>> >>>> On Jan 7, 2020, at 1:24 PM, Shawn Weeks <[email protected]> wrote: >>> >>> >>> >>> I’ve been playing around with it but I’m not sure how to do the kind of >>> validation I need. Consider this CSV. How would I validate this with >>> ValidateCSV? >>> >>> >>> >>> Good CSV >>> >>> c1,c2,c3 >>> >>> hello,world,1 >>> >>> hello,world, >>> >>> hello,, >>> >>> >>> >>> Bad CSV >>> >>> c1,c2,c3 >>> >>> hello,world,1 >>> >>> hello,world >>> >>> hello >>> >>> >>> >>> From: Emanuel Oliveira <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Tuesday, January 7, 2020 at 12:21 PM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Validating CSV File >>> >>> >>> >>> ValidateCsv is the most robust (handles missing fields as your need), it >>> doesn't use Avro Schemas, instead use inline sequence of functions to >>> accomplish anything you want (nulls ok or not, types, regex etc). >>> >>> >>> >>> In recent project while struggling for maximum data quality i tried all >>> different processors and options and ValidateCsv is the clear winner for >>> CSVs. >>> >>> >>> >>> Emanuel O. >>> >>> >>> >>>> On Mon 6 Jan 2020, 23:36 Matt Burgess, <[email protected]> wrote: >>> >>> What about ValidateCsv, could that do what you want? >>> >>> Sent from my iPhone >>> >>> >>> >>> >>>> On Jan 6, 2020, at 6:10 PM, Shawn Weeks <[email protected]> wrote: >>> >>> I’m poking around to see if I can make the csv parsers fail on a schema >>> mismatch like that. A stream command would be a good option though. >>> >>> >>> >>> Thanks >>> >>> Shawn >>> >>> >>> >>> From: Mike Thomsen <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Monday, January 6, 2020 at 4:35 PM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Validating CSV File >>> >>> >>> >>> We have a lot of the same issues where I work, and our solution is to use >>> ExecuteStreamCommand to pass CSVs off to Python scripts that will read >>> stdin line by line to check to see if the export isn't screwed up. Some of >>> our sources are good and we don't have to do that, but others are >>> minefields in terms of the quality of the upstream data source, and that's >>> the only way we've found where we can predictably handle such things. >>> >>> >>> >>>> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <[email protected]> >>>> wrote: >>> >>> That's the challenge, the values can be null but I want to know the fields >>> are missing(aka not enough delimiters). I run into a common scenario where >>> line feeds end up in the data making a short row. Currently the reader just >>> ignores the fact that there aren't enough delimiters and makes them null. >>> >>>> On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected]> wrote: >>> >>> Shawn, >>> >>> Your schema indicates that the fields are optional because of the >>> "type" : ["null", "string"] , so IIRC they won't be marked as invalid >>> because they are treated as null (I'm not sure there's a difference in >>> the code between missing and null fields). >>> >>> You can try "type": "string" in ValidateRecord to see if that fixes >>> it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV. >>> >>> Regards, >>> Matt >>> >>> On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <[email protected]> >>> wrote: >>>> >>>> I’m trying to validate that a csv file has the number of fields defined in >>>> it’s Avro schema. Consider the following schema and CSVs. I would like to >>>> be able to reject the invalid csv as missing fields. >>>> >>>> >>>> >>>> { >>>> >>>> "type" : "record", >>>> >>>> "namespace" : "nifi", >>>> >>>> "name" : "nifi", >>>> >>>> "fields" : [ >>>> >>>> { "name" : "c1" , "type" : ["null", "string"] }, >>>> >>>> { "name" : "c2" , "type" : ["null", "string"] }, >>>> >>>> { "name" : "c3" , "type" : ["null", "string"] } >>>> >>>> ] >>>> >>>> } >>>> >>>> >>>> >>>> Good CSV >>>> >>>> c1,c2,c3 >>>> >>>> hello,world,1 >>>> >>>> hello,world, >>>> >>>> hello,, >>>> >>>> >>>> >>>> Bad CSV >>>> >>>> c1,c2,c3 >>>> >>>> hello,world,1 >>>> >>>> hello,world >>>> >>>> hello >>>> >>>> >>> >>> >>> >> >> >> >> >> >> >
