[jira] [Commented] (NIFI-9462) JsonTreeReader schema inference only examines first record for parts of structure, causing data loss for subsequent records

2021-12-29 Thread Otto Fowler (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466546#comment-17466546
 ] 

Otto Fowler commented on NIFI-9462:
---

Sorry [~Josiah.Johnston], I was greatly mistaken

> JsonTreeReader schema inference only examines first record for parts of 
> structure, causing data loss for subsequent records
> ---
>
> Key: NIFI-9462
> URL: https://issues.apache.org/jira/browse/NIFI-9462
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Josiah Johnston
>Priority: Major
> Attachments: JSON record set writer.png, JSON tree reader.png, 
> NIFI-9462_example.xml, flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an 
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {"_source": {"name": "battery-voltage-changed", "metadata":
> {"voltage": 2.8}
> }}
> {"_source": {"name": "temperature-changed", "metadata":
> {"temperature": 19.54}
> }}
> {"other_L1_keys": "are_preserved", "_source": {"other_L2_keys": 
> "are_preserved", "metadata":
> {"voltage": 6.3, "other_L3_keys": "are_lost"}
> }}
> In the output, the structure of `_source.metadata.*` is always strictly based 
> on the first record, causing data loss for subsequent records that have 
> different fields.
> {"_source":{"name":"battery-voltage-changed","metadata":
> {"voltage":2.8}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":"temperature-changed","metadata":
> {"voltage":null}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":null,"metadata":
> {"voltage":6.3}
> ,"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
>  In general it drops all 3rd level keys weren't seen in the first record 
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
> record 3). This behavior only applies to keys in the 3rd level; schema 
> inference works as documented (scanning through all records) for alternative 
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in 
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader & 
> writer. I've also attached a template of a minimal example. If you import it, 
> you'll need to create & enable controller services of JsonTreeReader & 
> JsonRecordSetWriter (default values for each).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (NIFI-9462) JsonTreeReader schema inference only examines first record for parts of structure, causing data loss for subsequent records

2021-12-29 Thread Mark Payne (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466517#comment-17466517
 ] 

Mark Payne commented on NIFI-9462:
--

[~Josiah.Johnston] can you try running this flow in a newer version of NiFi? I 
just tried running it in 1.15.1, and I got the output that I was expecting. 
Namely:
{noformat}
{"_source":{"name":"battery-voltage-changed","metadata":{"voltage":2.8},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"d8670196-f302-4860-8753-213cdcdf9c24"}
{"_source":{"name":"temperature-changed","metadata":{"temperature":19.54},"other_L2_keys":null},"other_L1_keys":null,"s3_key":"d8670196-f302-4860-8753-213cdcdf9c24"}
{"_source":{"name":null,"metadata":{"voltage":6.3,"other_L3_keys":"are_lost"},"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"d8670196-f302-4860-8753-213cdcdf9c24"}
 {noformat}
Is this the output that you expected? If so please try with the latest, 1.15.2

If this is NOT the expected output then please help me to understand what I 
overlooked.

Thanks

> JsonTreeReader schema inference only examines first record for parts of 
> structure, causing data loss for subsequent records
> ---
>
> Key: NIFI-9462
> URL: https://issues.apache.org/jira/browse/NIFI-9462
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Josiah Johnston
>Priority: Major
> Attachments: JSON record set writer.png, JSON tree reader.png, 
> NIFI-9462_example.xml, flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an 
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {"_source": {"name": "battery-voltage-changed", "metadata":
> {"voltage": 2.8}
> }}
> {"_source": {"name": "temperature-changed", "metadata":
> {"temperature": 19.54}
> }}
> {"other_L1_keys": "are_preserved", "_source": {"other_L2_keys": 
> "are_preserved", "metadata":
> {"voltage": 6.3, "other_L3_keys": "are_lost"}
> }}
> In the output, the structure of `_source.metadata.*` is always strictly based 
> on the first record, causing data loss for subsequent records that have 
> different fields.
> {"_source":{"name":"battery-voltage-changed","metadata":
> {"voltage":2.8}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":"temperature-changed","metadata":
> {"voltage":null}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":null,"metadata":
> {"voltage":6.3}
> ,"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
>  In general it drops all 3rd level keys weren't seen in the first record 
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
> record 3). This behavior only applies to keys in the 3rd level; schema 
> inference works as documented (scanning through all records) for alternative 
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in 
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader & 
> writer. I've also attached a template of a minimal example. If you import it, 
> you'll need to create & enable controller services of JsonTreeReader & 
> JsonRecordSetWriter (default values for each).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (NIFI-9462) JsonTreeReader schema inference only examines first record for parts of structure, causing data loss for subsequent records

2021-12-29 Thread Mark Payne (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466516#comment-17466516
 ] 

Mark Payne commented on NIFI-9462:
--

[~otto] the schema inference should not use the first record only. Rather, it 
should be reading ALL records and creating an "uber schema" that encapsulates 
all fields. In this way, all records will be homogeneous in that they all 
implement the same schema. But it's very common to have some records with 
missing fields, especially with JSON where null fields are often left unwritten 
in order to improve efficiency, etc.

> JsonTreeReader schema inference only examines first record for parts of 
> structure, causing data loss for subsequent records
> ---
>
> Key: NIFI-9462
> URL: https://issues.apache.org/jira/browse/NIFI-9462
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Josiah Johnston
>Priority: Major
> Attachments: JSON record set writer.png, JSON tree reader.png, 
> NIFI-9462_example.xml, flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an 
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {"_source": {"name": "battery-voltage-changed", "metadata":
> {"voltage": 2.8}
> }}
> {"_source": {"name": "temperature-changed", "metadata":
> {"temperature": 19.54}
> }}
> {"other_L1_keys": "are_preserved", "_source": {"other_L2_keys": 
> "are_preserved", "metadata":
> {"voltage": 6.3, "other_L3_keys": "are_lost"}
> }}
> In the output, the structure of `_source.metadata.*` is always strictly based 
> on the first record, causing data loss for subsequent records that have 
> different fields.
> {"_source":{"name":"battery-voltage-changed","metadata":
> {"voltage":2.8}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":"temperature-changed","metadata":
> {"voltage":null}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":null,"metadata":
> {"voltage":6.3}
> ,"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
>  In general it drops all 3rd level keys weren't seen in the first record 
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
> record 3). This behavior only applies to keys in the 3rd level; schema 
> inference works as documented (scanning through all records) for alternative 
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in 
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader & 
> writer. I've also attached a template of a minimal example. If you import it, 
> you'll need to create & enable controller services of JsonTreeReader & 
> JsonRecordSetWriter (default values for each).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (NIFI-9462) JsonTreeReader schema inference only examines first record for parts of structure, causing data loss for subsequent records

2021-12-29 Thread Otto Fowler (Jira)


[ 
https://issues.apache.org/jira/browse/NIFI-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17466504#comment-17466504
 ] 

Otto Fowler commented on NIFI-9462:
---

I believe this is working correctly. When you are passing multiple json 
structures to the schema inference it is going to assume that they are  
homogeneous and use the first one.

> JsonTreeReader schema inference only examines first record for parts of 
> structure, causing data loss for subsequent records
> ---
>
> Key: NIFI-9462
> URL: https://issues.apache.org/jira/browse/NIFI-9462
> Project: Apache NiFi
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Josiah Johnston
>Priority: Major
> Attachments: JSON record set writer.png, JSON tree reader.png, 
> NIFI-9462_example.xml, flow.png, updateRecord.png
>
>
> I use GenerateFlowFile with this JSON line content, and send it through an 
> UpdateRecord processor that uses a JsonTreeReader with schema inference. The 
> UpdateRecord processor adds a top level `s3_key` element with the filename.
> {"_source": {"name": "battery-voltage-changed", "metadata":
> {"voltage": 2.8}
> }}
> {"_source": {"name": "temperature-changed", "metadata":
> {"temperature": 19.54}
> }}
> {"other_L1_keys": "are_preserved", "_source": {"other_L2_keys": 
> "are_preserved", "metadata":
> {"voltage": 6.3, "other_L3_keys": "are_lost"}
> }}
> In the output, the structure of `_source.metadata.*` is always strictly based 
> on the first record, causing data loss for subsequent records that have 
> different fields.
> {"_source":{"name":"battery-voltage-changed","metadata":
> {"voltage":2.8}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":"temperature-changed","metadata":
> {"voltage":null}
> ,"other_L2_keys":null},"other_L1_keys":null,"s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
> {"_source":{"name":null,"metadata":
> {"voltage":6.3}
> ,"other_L2_keys":"are_preserved"},"other_L1_keys":"are_preserved","s3_key":"9830423c-c8b6-4a03-a4a1-427750e94d26"}
>  In general it drops all 3rd level keys weren't seen in the first record 
> (_source.metadata.temperature in record 2, _source.metadata.other_L3_keys in 
> record 3). This behavior only applies to keys in the 3rd level; schema 
> inference works as documented (scanning through all records) for alternative 
> keys in the 1st & 2nd level.
> This behavior persists whether I specify the input as JSON lines (shown in 
> this example), or if I rearrange it to be a JSON array.
> I've attached screenshots of a minimal example and settings of JSON reader & 
> writer. I've also attached a template of a minimal example. If you import it, 
> you'll need to create & enable controller services of JsonTreeReader & 
> JsonRecordSetWriter (default values for each).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)