Re: Validating CSV File

Shawn Weeks Tue, 07 Jan 2020 13:09:58 -0800

Just saw it. It looks pretty simple and you gave plenty of details on how to 
implement it. Thanks


Sent from my iPhone

> On Jan 7, 2020, at 2:56 PM, Mark Payne <[email protected]> wrote:
> 
> Shawn,
> 
> Please see the Jira that I referenced below. It explains how to determine 
> whether or not a field exists in the record.
> 
> Thanks
> -Mark
> 
>> On Jan 7, 2020, at 3:38 PM, Shawn Weeks <[email protected]> wrote:
>> 
>> Seems like all you need is for the getValue method on the Record Interface 
>> to return something different if the Field doesn't exist instead of null. 
>> Maybe an enum? Then you could correctly determine that a nullable field 
>> didn't exist vs was just null in ValidateRecord.
>> 
>> On 1/7/20, 2:03 PM, "Shawn Weeks" <[email protected]> wrote:
>> 
>>   So I'm playing around with it some more and ValidateRecord is kinda 
>> working but it's finding stuff valid that isn't. Shouldn't some of those 
>> records have been rejected? I realized that by writing out to AVRO it was 
>> stubbing out the records but the actual input didn't have them.
>> 
>>   This is the input
>>   c1,c2,c3
>>   hello,world,1
>>   hello,world
>>   hello
>>   hello,,
>> 
>>   The schema
>>   {
>>      "type" : "record",
>>      "namespace" : "nifi",
>>      "name" : "nifi",
>>      "fields" : [
>>         { "name" : "c1" , "type" : ["null","string"] },
>>         { "name" : "c2" , "type" : ["null","string"] },
>>         { "name" : "c3" , "type" : ["null","string"] }
>>      ]
>>   }
>> 
>>   The output in JSON
>>   [ {
>>     "c1" : "hello",
>>     "c2" : "world",
>>     "c3" : "1"
>>   }, {
>>     "c1" : "hello",
>>     "c2" : "world"
>>   }, {
>>     "c1" : "hello"
>>   }, {
>>     "c1" : "hello",
>>     "c2" : null,
>>     "c3" : null
>>   } ]
>> 
>>   On 1/7/20, 1:14 PM, "Shawn Weeks" <[email protected]> wrote:
>> 
>>       I think to do that we'll need a way to distinguish between fields that 
>> were there and blank and fields that were never populated. Maybe an 
>> extension of the record api itself to distinguish between null and not 
>> there. Or allowing the CSV Reader to redefine null to something like empty 
>> string.
>> 
>>       I welcome any ideas.
>> 
>>       Thanks
>>       Shawn
>> 
>>       On 1/7/20, 1:07 PM, "Matt Burgess" <[email protected]> wrote:
>> 
>>           I believe ValidateRecord would route the flow file to failure in 
>> that
>>           case. If you know the kind of data going in, then you might just be
>>           able to treat the failure relationship more like the invalid
>>           relationship. However I like the idea of doing it in 
>> ValidateRecord as
>>           Mark proposed in the Jira.
>> 
>>>           On Tue, Jan 7, 2020 at 2:02 PM Shawn Weeks 
>>> <[email protected]> wrote:
>>> 
>>> I see where I think we could do this in the CSV Reader itself.  Jackson 
>>> actually has an option to fail on missing fields and for csv commons we 
>>> know how many fields we have and how many are in a record. Maybe just add a 
>>> couple of true/false fields to fail on too few or too many fields like we 
>>> do with PutDatabaseRecord. Not sure how to integrate that into 
>>> ValidateRecord though as by that point it’s already parses.
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>> Shawn
>>> 
>>> 
>>> 
>>> From: Mark Payne <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Tuesday, January 7, 2020 at 12:52 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Validating CSV File
>>> 
>>> 
>>> 
>>> I do agree that this is something that is not easily done with 
>>> ValidateRecord (and probably cannot be done with ValidateRecord). But it's 
>>> something that I think ValidateRecord *should* allow for. To that end, I've 
>>> created a JIRA [1] to track this.
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>> -Mark
>>> 
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/NIFI-6986
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Jan 7, 2020, at 1:24 PM, Shawn Weeks <[email protected]> wrote:
>>> 
>>> 
>>> 
>>> I’ve been playing around with it but I’m not sure how to do the kind of 
>>> validation I need. Consider this CSV. How would I validate this with 
>>> ValidateCSV?
>>> 
>>> 
>>> 
>>> Good CSV
>>> 
>>> c1,c2,c3
>>> 
>>> hello,world,1
>>> 
>>> hello,world,
>>> 
>>> hello,,
>>> 
>>> 
>>> 
>>> Bad CSV
>>> 
>>> c1,c2,c3
>>> 
>>> hello,world,1
>>> 
>>> hello,world
>>> 
>>> hello
>>> 
>>> 
>>> 
>>> From: Emanuel Oliveira <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Tuesday, January 7, 2020 at 12:21 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Validating CSV File
>>> 
>>> 
>>> 
>>> ValidateCsv is the most robust (handles missing fields as your need), it 
>>> doesn't use Avro Schemas, instead use inline sequence of functions to 
>>> accomplish anything you want (nulls ok or not, types, regex etc).
>>> 
>>> 
>>> 
>>> In recent project while struggling for maximum data quality i tried all 
>>> different processors and options and ValidateCsv is the clear winner for 
>>> CSVs.
>>> 
>>> 
>>> 
>>> Emanuel O.
>>> 
>>> 
>>> 
>>>> On Mon 6 Jan 2020, 23:36 Matt Burgess, <[email protected]> wrote:
>>> 
>>> What about ValidateCsv, could that do what you want?
>>> 
>>> Sent from my iPhone
>>> 
>>> 
>>> 
>>> 
>>>> On Jan 6, 2020, at 6:10 PM, Shawn Weeks <[email protected]> wrote:
>>> 
>>> I’m poking around to see if I can make the csv parsers fail on a schema 
>>> mismatch like that. A stream command would be a good option though.
>>> 
>>> 
>>> 
>>> Thanks
>>> 
>>> Shawn
>>> 
>>> 
>>> 
>>> From: Mike Thomsen <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Monday, January 6, 2020 at 4:35 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Validating CSV File
>>> 
>>> 
>>> 
>>> We have a lot of the same issues where I work, and our solution is to use 
>>> ExecuteStreamCommand to pass CSVs off to Python scripts that will read 
>>> stdin line by line to check to see if the export isn't screwed up. Some of 
>>> our sources are good and we don't have to do that, but others are 
>>> minefields in terms of the quality of the upstream data source, and that's 
>>> the only way we've found where we can predictably handle such things.
>>> 
>>> 
>>> 
>>>> On Mon, Jan 6, 2020 at 4:57 PM Shawn Weeks <[email protected]> 
>>>> wrote:
>>> 
>>> That's the challenge, the values can be null but I want to know the fields 
>>> are missing(aka not enough delimiters). I run into a common scenario where 
>>> line feeds end up in the data making a short row. Currently the reader just 
>>> ignores the fact that there aren't enough delimiters and makes them null.
>>> 
>>>> On 1/6/20, 3:50 PM, "Matt Burgess" <[email protected]> wrote:
>>> 
>>>   Shawn,
>>> 
>>>   Your schema indicates that the fields are optional because of the
>>>   "type" :  ["null", "string"] , so IIRC they won't be marked as invalid
>>>   because they are treated as null (I'm not sure there's a difference in
>>>   the code between missing and null fields).
>>> 
>>>   You can try "type": "string" in ValidateRecord to see if that fixes
>>>   it, or there's a "StrNotNullOrEmpty" operator in ValidateCSV.
>>> 
>>>   Regards,
>>>   Matt
>>> 
>>>   On Mon, Jan 6, 2020 at 4:35 PM Shawn Weeks <[email protected]> 
>>> wrote:
>>>> 
>>>> I’m trying to validate that a csv file has the number of fields defined in 
>>>> it’s Avro schema. Consider the following schema and CSVs. I would like to 
>>>> be able to reject the invalid csv as missing fields.
>>>> 
>>>> 
>>>> 
>>>> {
>>>> 
>>>>  "type" : "record",
>>>> 
>>>>  "namespace" : "nifi",
>>>> 
>>>>  "name" : "nifi",
>>>> 
>>>>  "fields" : [
>>>> 
>>>>     { "name" : "c1" , "type" :  ["null", "string"] },
>>>> 
>>>>     { "name" : "c2" , "type" : ["null", "string"] },
>>>> 
>>>>     { "name" : "c3" , "type" : ["null", "string"] }
>>>> 
>>>>  ]
>>>> 
>>>> }
>>>> 
>>>> 
>>>> 
>>>> Good CSV
>>>> 
>>>> c1,c2,c3
>>>> 
>>>> hello,world,1
>>>> 
>>>> hello,world,
>>>> 
>>>> hello,,
>>>> 
>>>> 
>>>> 
>>>> Bad CSV
>>>> 
>>>> c1,c2,c3
>>>> 
>>>> hello,world,1
>>>> 
>>>> hello,world
>>>> 
>>>> hello
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>

Re: Validating CSV File

Reply via email to