Re: Validating CSV File

Shawn Weeks Tue, 07 Jan 2020 12:39:31 -0800

Seems like all you need is for the getValue method on the Record Interface to 
return something different if the Field doesn't exist instead of null. Maybe an 
enum? Then you could correctly determine that a nullable field didn't exist vs 
was just null in ValidateRecord.


On 1/7/20, 2:03 PM, "Shawn Weeks" <[email protected]> wrote:

    So I'm playing around with it some more and ValidateRecord is kinda working 
but it's finding stuff valid that isn't. Shouldn't some of those records have 
been rejected? I realized that by writing out to AVRO it was stubbing out the 
records but the actual input didn't have them.
    
    This is the input
    c1,c2,c3
    hello,world,1
    hello,world
    hello
    hello,,
    
    The schema
    {
       "type" : "record",
       "namespace" : "nifi",
       "name" : "nifi",
       "fields" : [
          { "name" : "c1" , "type" : ["null","string"] },
          { "name" : "c2" , "type" : ["null","string"] },
          { "name" : "c3" , "type" : ["null","string"] }
       ]
    }
    
    The output in JSON
    [ {
      "c1" : "hello",
      "c2" : "world",
      "c3" : "1"
    }, {
      "c1" : "hello",
      "c2" : "world"
    }, {
      "c1" : "hello"
    }, {
      "c1" : "hello",
      "c2" : null,
      "c3" : null
    } ]
    
    On 1/7/20, 1:14 PM, "Shawn Weeks" <[email protected]> wrote:
    
        I think to do that we'll need a way to distinguish between fields that 
were there and blank and fields that were never populated. Maybe an extension 
of the record api itself to distinguish between null and not there. Or allowing 
the CSV Reader to redefine null to something like empty string.
        
        I welcome any ideas.
        
        Thanks
        Shawn
        
        On 1/7/20, 1:07 PM, "Matt Burgess" <[email protected]> wrote:
        
            I believe ValidateRecord would route the flow file to failure in 
that
            case. If you know the kind of data going in, then you might just be
            able to treat the failure relationship more like the invalid
            relationship. However I like the idea of doing it in ValidateRecord 
as
            Mark proposed in the Jira.
            
            On Tue, Jan 7, 2020 at 2:02 PM Shawn Weeks 
<[email protected]> wrote:
            >
            > I see where I think we could do this in the CSV Reader itself.  
Jackson actually has an option to fail on missing fields and for csv commons we 
know how many fields we have and how many are in a record. Maybe just add a 
couple of true/false fields to fail on too few or too many fields like we do 
with PutDatabaseRecord. Not sure how to integrate that into ValidateRecord 
though as by that point it’s already parses.
            >
            >
            >
            > Thanks
            >
            > Shawn
            >
            >
            >
            > From: Mark Payne <[email protected]>
            > Reply-To: "[email protected]" <[email protected]>
            > Date: Tuesday, January 7, 2020 at 12:52 PM
            > To: "[email protected]" <[email protected]>
            > Subject: Re: Validating CSV File
            >
            >
            >
            > I do agree that this is something that is not easily done with 
ValidateRecord (and probably cannot be done with ValidateRecord). But it's 
something that I think ValidateRecord *should* allow for. To that end, I've 
created a JIRA [1] to track this.
            >
            >
            >
            > Thanks
            >
            > -Mark
            >
            >
            >
            > [1] https://issues.apache.org/jira/browse/NIFI-6986
            >
            >
            >
            >
            >
            > On Jan 7, 2020, at 1:24 PM, Shawn Weeks 
<[email protected]> wrote:
            >
            >
            >
            > I’ve been playing around with it but I’m not sure how to do the 
kind of validation I need. Consider this CSV. How would I validate this with 
ValidateCSV?
            >
            >
            >
            > Good CSV
            >
            > c1,c2,c3
            >
            > hello,world,1
            >
            > hello,world,
            >
            > hello,,
            >
            >
            >
            > Bad CSV
            >
            > c1,c2,c3
            >
            > hello,world,1
            >
            > hello,world
            >
            > hello
            >
            >
            >
            > From: Emanuel Oliveira <[email protected]>
            > Reply-To: "[email protected]" <[email protected]>
            > Date: Tuesday, January 7, 2020 at 12:21 PM
            > To: "[email protected]" <[email protected]>
            > Subject: Re: Validating CSV File
            >
            >
            >
            > ValidateCsv is the most robust (handles missing fields as your 
need), it doesn't use Avro Schemas, instead use inline sequence of functions to 
accomplish anything you want (nulls ok or not, types, regex etc).
            >
            >
            >
            > In recent project while struggling for maximum data quality i 
tried all different processors and options and ValidateCsv is the clear winner 
for CSVs.
            >
            >
            >
            > Emanuel O.
            >
            >
            >
            > On Mon 6 Jan 2020, 23:36 Matt Burgess, <[email protected]> 
wrote:
            >
            > What about ValidateCsv, could that do what you want?
            >
            > Sent from my iPhone
            >
            >
            >
            >
            > On Jan 6, 2020, at 6:10 PM, Shawn Weeks 
<[email protected]> wrote:
            >
            > I’m poking around to see if I can make the csv parsers fail on a 
schema mismatch like that. A stream command would be a good option though.
            >
            >
            >
            > Thanks
            >
            > Shawn
            >
            >
            >
            > From: Mike Thomsen <[email protected]>
            > Reply-To: "[email protected]" <[email protected]>
            > Date: Monday, January 6, 2020 at 4:35 PM
            > To: "[email protected]" <[email protected]>
            > Subject: Re: Validating CSV File
            >
            >
            >
            > We have a lot of the same issues where I work, and our solution 
is to use ExecuteStreamCommand to pass CSVs off to Python scripts that will 
read stdin line by line to check to see if the export isn't screwed up. Some of 
our sources are good and we don't have to do that, but others are minefields in 
terms of the quality of the upstream data source, and that's the only way we've 
found where we can predictably handle such things.
            >
            >
            >
            > On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks 
<[email protected]> wrote:
            >
            > That's the challenge, the values can be null but I want to know 
the fields are missing(aka not enough delimiters). I run into a common scenario 
where line feeds end up in the data making a short row. Currently the reader 
just ignores the fact that there aren't enough delimiters and makes them null.
            >
            > On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected]> wrote:
            >
            >     Shawn,
            >
            >     Your schema indicates that the fields are optional because of 
the
            >     "type" :  ["null", "string"] , so IIRC they won't be marked 
as invalid
            >     because they are treated as null (I'm not sure there's a 
difference in
            >     the code between missing and null fields).
            >
            >     You can try "type": "string" in ValidateRecord to see if that 
fixes
            >     it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.
            >
            >     Regards,
            >     Matt
            >
            >     On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks 
<[email protected]> wrote:
            >     >
            >     > I’m trying to validate that a csv file has the number of 
fields defined in it’s Avro schema. Consider the following schema and CSVs. I 
would like to be able to reject the invalid csv as missing fields.
            >     >
            >     >
            >     >
            >     > {
            >     >
            >     >    "type" : "record",
            >     >
            >     >    "namespace" : "nifi",
            >     >
            >     >    "name" : "nifi",
            >     >
            >     >    "fields" : [
            >     >
            >     >       { "name" : "c1" , "type" :  ["null", "string"] },
            >     >
            >     >       { "name" : "c2" , "type" : ["null", "string"] },
            >     >
            >     >       { "name" : "c3" , "type" : ["null", "string"] }
            >     >
            >     >    ]
            >     >
            >     > }
            >     >
            >     >
            >     >
            >     > Good CSV
            >     >
            >     > c1,c2,c3
            >     >
            >     > hello,world,1
            >     >
            >     > hello,world,
            >     >
            >     > hello,,
            >     >
            >     >
            >     >
            >     > Bad CSV
            >     >
            >     > c1,c2,c3
            >     >
            >     > hello,world,1
            >     >
            >     > hello,world
            >     >
            >     > hello
            >     >
            >     >
            >
            >
            >

Re: Validating CSV File

Reply via email to