Re: Validating CSV File

Mark Payne Tue, 07 Jan 2020 12:57:21 -0800

Shawn,

Please see the Jira that I referenced below. It explains how to determine 
whether or not a field exists in the record.


Thanks
-Mark

> On Jan 7, 2020, at 3:38 PM, Shawn Weeks <[email protected]> wrote:
> 
> Seems like all you need is for the getValue method on the Record Interface to 
> return something different if the Field doesn't exist instead of null. Maybe 
> an enum? Then you could correctly determine that a nullable field didn't 
> exist vs was just null in ValidateRecord.
> 
> On 1/7/20, 2:03 PM, "Shawn Weeks" <[email protected]> wrote:
> 
>    So I'm playing around with it some more and ValidateRecord is kinda 
> working but it's finding stuff valid that isn't. Shouldn't some of those 
> records have been rejected? I realized that by writing out to AVRO it was 
> stubbing out the records but the actual input didn't have them.
> 
>    This is the input
>    c1,c2,c3
>    hello,world,1
>    hello,world
>    hello
>    hello,,
> 
>    The schema
>    {
>       "type" : "record",
>       "namespace" : "nifi",
>       "name" : "nifi",
>       "fields" : [
>          { "name" : "c1" , "type" : ["null","string"] },
>          { "name" : "c2" , "type" : ["null","string"] },
>          { "name" : "c3" , "type" : ["null","string"] }
>       ]
>    }
> 
>    The output in JSON
>    [ {
>      "c1" : "hello",
>      "c2" : "world",
>      "c3" : "1"
>    }, {
>      "c1" : "hello",
>      "c2" : "world"
>    }, {
>      "c1" : "hello"
>    }, {
>      "c1" : "hello",
>      "c2" : null,
>      "c3" : null
>    } ]
> 
>    On 1/7/20, 1:14 PM, "Shawn Weeks" <[email protected]> wrote:
> 
>        I think to do that we'll need a way to distinguish between fields that 
> were there and blank and fields that were never populated. Maybe an extension 
> of the record api itself to distinguish between null and not there. Or 
> allowing the CSV Reader to redefine null to something like empty string.
> 
>        I welcome any ideas.
> 
>        Thanks
>        Shawn
> 
>        On 1/7/20, 1:07 PM, "Matt Burgess" <[email protected]> wrote:
> 
>            I believe ValidateRecord would route the flow file to failure in 
> that
>            case. If you know the kind of data going in, then you might just be
>            able to treat the failure relationship more like the invalid
>            relationship. However I like the idea of doing it in 
> ValidateRecord as
>            Mark proposed in the Jira.
> 
>            On Tue, Jan 7, 2020 at 2:02 PM Shawn Weeks 
> <[email protected]> wrote:
>> 
>> I see where I think we could do this in the CSV Reader itself.  Jackson 
>> actually has an option to fail on missing fields and for csv commons we know 
>> how many fields we have and how many are in a record. Maybe just add a 
>> couple of true/false fields to fail on too few or too many fields like we do 
>> with PutDatabaseRecord. Not sure how to integrate that into ValidateRecord 
>> though as by that point it’s already parses.
>> 
>> 
>> 
>> Thanks
>> 
>> Shawn
>> 
>> 
>> 
>> From: Mark Payne <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, January 7, 2020 at 12:52 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Validating CSV File
>> 
>> 
>> 
>> I do agree that this is something that is not easily done with 
>> ValidateRecord (and probably cannot be done with ValidateRecord). But it's 
>> something that I think ValidateRecord *should* allow for. To that end, I've 
>> created a JIRA [1] to track this.
>> 
>> 
>> 
>> Thanks
>> 
>> -Mark
>> 
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/NIFI-6986
>> 
>> 
>> 
>> 
>> 
>> On Jan 7, 2020, at 1:24 PM, Shawn Weeks <[email protected]> wrote:
>> 
>> 
>> 
>> I’ve been playing around with it but I’m not sure how to do the kind of 
>> validation I need. Consider this CSV. How would I validate this with 
>> ValidateCSV?
>> 
>> 
>> 
>> Good CSV
>> 
>> c1,c2,c3
>> 
>> hello,world,1
>> 
>> hello,world,
>> 
>> hello,,
>> 
>> 
>> 
>> Bad CSV
>> 
>> c1,c2,c3
>> 
>> hello,world,1
>> 
>> hello,world
>> 
>> hello
>> 
>> 
>> 
>> From: Emanuel Oliveira <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Tuesday, January 7, 2020 at 12:21 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Validating CSV File
>> 
>> 
>> 
>> ValidateCsv is the most robust (handles missing fields as your need), it 
>> doesn't use Avro Schemas, instead use inline sequence of functions to 
>> accomplish anything you want (nulls ok or not, types, regex etc).
>> 
>> 
>> 
>> In recent project while struggling for maximum data quality i tried all 
>> different processors and options and ValidateCsv is the clear winner for 
>> CSVs.
>> 
>> 
>> 
>> Emanuel O.
>> 
>> 
>> 
>> On Mon 6 Jan 2020, 23:36 Matt Burgess, <[email protected]> wrote:
>> 
>> What about ValidateCsv, could that do what you want?
>> 
>> Sent from my iPhone
>> 
>> 
>> 
>> 
>> On Jan 6, 2020, at 6:10 PM, Shawn Weeks <[email protected]> wrote:
>> 
>> I’m poking around to see if I can make the csv parsers fail on a schema 
>> mismatch like that. A stream command would be a good option though.
>> 
>> 
>> 
>> Thanks
>> 
>> Shawn
>> 
>> 
>> 
>> From: Mike Thomsen <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Monday, January 6, 2020 at 4:35 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Validating CSV File
>> 
>> 
>> 
>> We have a lot of the same issues where I work, and our solution is to use 
>> ExecuteStreamCommand to pass CSVs off to Python scripts that will read stdin 
>> line by line to check to see if the export isn't screwed up. Some of our 
>> sources are good and we don't have to do that, but others are minefields in 
>> terms of the quality of the upstream data source, and that's the only way 
>> we've found where we can predictably handle such things.
>> 
>> 
>> 
>> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <[email protected]> wrote:
>> 
>> That's the challenge, the values can be null but I want to know the fields 
>> are missing(aka not enough delimiters). I run into a common scenario where 
>> line feeds end up in the data making a short row. Currently the reader just 
>> ignores the fact that there aren't enough delimiters and makes them null.
>> 
>> On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected]> wrote:
>> 
>>    Shawn,
>> 
>>    Your schema indicates that the fields are optional because of the
>>    "type" :  ["null", "string"] , so IIRC they won't be marked as invalid
>>    because they are treated as null (I'm not sure there's a difference in
>>    the code between missing and null fields).
>> 
>>    You can try "type": "string" in ValidateRecord to see if that fixes
>>    it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.
>> 
>>    Regards,
>>    Matt
>> 
>>    On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <[email protected]> 
>> wrote:
>>> 
>>> I’m trying to validate that a csv file has the number of fields defined in 
>>> it’s Avro schema. Consider the following schema and CSVs. I would like to 
>>> be able to reject the invalid csv as missing fields.
>>> 
>>> 
>>> 
>>> {
>>> 
>>>   "type" : "record",
>>> 
>>>   "namespace" : "nifi",
>>> 
>>>   "name" : "nifi",
>>> 
>>>   "fields" : [
>>> 
>>>      { "name" : "c1" , "type" :  ["null", "string"] },
>>> 
>>>      { "name" : "c2" , "type" : ["null", "string"] },
>>> 
>>>      { "name" : "c3" , "type" : ["null", "string"] }
>>> 
>>>   ]
>>> 
>>> }
>>> 
>>> 
>>> 
>>> Good CSV
>>> 
>>> c1,c2,c3
>>> 
>>> hello,world,1
>>> 
>>> hello,world,
>>> 
>>> hello,,
>>> 
>>> 
>>> 
>>> Bad CSV
>>> 
>>> c1,c2,c3
>>> 
>>> hello,world,1
>>> 
>>> hello,world
>>> 
>>> hello
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
>

Re: Validating CSV File

Reply via email to