[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-23 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
This is now merged to master. Thanks again for the contribution!


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-23 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@JohannesDaniel that would be great! I think the Record attribute stuff 
will significantly improve how we are able to handle XML-based records. But I 
think the approach that you've taken here is a great first step and will be 
very beneficial to the community. I'd recommend going ahead with the XML Writer 
and then we can work in the attribute-related stuff later.


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-23 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@markap14 thanks for the help with the patch file. I am fine with it :)

If you have not planned to do that yourself, I would start implementing a 
XMLWriter as soon as this has been merged. Or should we first discuss your plan 
with the Record attributes you mentioned above?


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-23 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@JohannesDaniel this is great! I've been doing a good bit of testing to 
ensure that everything works as expected. I had just a few more comments, 
mostly around the descriptions in the property descriptor and the handling of 
invalid values for the "Expect Records as Array" property. To make it a little 
easier to understand, I'm just going to attach a PATCH file to the JIRA. Can 
you please look over the patch file? If you're okay with the patch file, then 
I'm ready to merge. Thanks again, you've put in a lot of work here to make this 
consistent with the current approaches and to provide a huge new capability for 
many users!


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-22 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@markap14 
- Added EL for record format property
- Removed record tag validation


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-20 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@markap14 thank you for the response. 

I will simply remove that record tag validation as there are indeed many 
ways to do that before the data is processed by this reader. 

There is one little corner case, I need to discuss: 
Assuming we have the following data:
```
value1...
```
If the reader is used with (coerce==true), the field "map_field" can be 
parsed by defining a map in the schema. The embedded key fields do not have to 
be defined, its values only have to be of the defined type for the map. 

If the reader is used with (coerce==false && dropUnknown==true), the reader 
will parse all fields that exist in the schema ignoring its type. However, the 
data above will not be parsable even if the map exists in the schema. In this 
case, the reader identifies "map_field" as a field that exists in the schema, 
but the reader is not aware that it is of type map. Therefore, the reader will 
not parse the embedded key fields, as they don't exist in the schema. The field 
"map_field" will be classified as an empty field and not added to the record.

Furthermore, even if the reader is used with (coerce==false && 
dropUnknown==true), it will be type-aware to some extent. The reader first 
checks, whether fields exist in the schema. If that is the case, the reader 
additionally will check whether they are of type record (or of type array 
embedding records, respectively). If that is also the case, the reader will 
retrieve the subschema in order to be enabled to check whether subtags of the 
current tag are known. 



---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-20 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@JohannesDaniel thanks for the update! I commented above re: the use of 
Expression Language in the property descriptor.
I do still feel like the check for 'record tag names' is unnecessary, as 
the reader should not be responsible for filtering the data but rather just for 
reading it. There already exist mechanisms for filtering the data (You could 
use PartitionRecord + RouteOnAttribute, ValidateRecord, or QueryRecord just off 
the top of my head to achieve this). Additionally, we have the Schema for the 
Record Reader. So if the element name matches the top-level Schema name (or one 
of them, if the top-level field is a UNION/CHOICE element), then we could use 
that. So, with your example above, if you only want to read the 
`` part, your schema should indicate that the top-level field 
name is `record`. In that case, it should filter out the `other` record. Does 
that make sense? 


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-19 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@markap14 @pvillard31 
- I refactored some code as the cases (coerce==true && drop==false) and 
(coerce==false && drop==true) in some cases showed an unexpected behavior
- Data like contentcontentcontent now can be 
parsed
- Maps (e. g. 
value1value2) are now supported
- The reader is now able to parse single records (e. g. 
) as well as arrays of records (e. g. 
). I added a property to make it configurable 
whether the reader shall expect a single record or an array. One question: As 
there are only two options for this, I defined AllowableValues for this 
property. Despite that, I think it would be reasonable to enable EL for this 
property. But how can this be realized?
- I removed the root validation, but remained the check for record tag 
names in order to support processing data like this 
 (tag "other" will be ignored if check 
for record tag name is activated)


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-12 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@markap14 thank you for the comprehensive review. I will start refactoring 
the implementations with respect to the improvements that are clear.


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-04 Thread pvillard31
Github user pvillard31 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
Hey @JohannesDaniel - just made new tests and can't reproduce what I saw 
previously... so let's forget about it :) (thanks for the unit tests though!)

I tested the EL support for tag validation, working well, thanks for the 
addition. I'd just make one suggestion: right now for the record tag, "in the 
case of a mismatch, the respective record will be skipped". I'd suggest to make 
this behavior configurable through a parameter in the CS: skip the record 
(current behavior) or throw an exception for the flow file (as you're doing for 
the root tag validation).

Regarding the performance test, it was a GFF => ConvertRecord (XML to JSON) 
=> UpdateAttribute. Don't remember the exact numbers but IIRC it was around 
1300 flowfile per second with an XML payload of 6KB (single record in my case).

Overall, it's really cool!
I'll try to have a look over the code during the week.


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-01 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
Can you maybe post the XML that led to the empty record?


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-04-01 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
Hi @pvillard31 

thank you for your comments! I realized all your suggestions. I like your 
news regarding the performance :-) Which kind of transformation did you test? 
XML => Record or XML => JSON (e. g. with ConvertRecord)?

For any reason some tests disappeared for a certain commit at my local git 
(probably, I wanted to reorder the tests,  but deleted them, omg ...). However, 
I inserted them again (this is why there are many more tests now). 

In addition, I adjusted the definition about how namespaces shall be 
treated. 

I implemented several tests for XMLReader to verify that the usage of 
expression language works as expected.

However, I was not able to reproduce your observation regarding the empty 
record for the header 
```

```

I implemented the following tests:
```
testSimpleRecordWithHeader()
testSimpleRecordWithHeaderNoValidation()
```

Actually, they work as expected. 


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-03-29 Thread pvillard31
Github user pvillard31 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
Thanks for pinging me @JohannesDaniel - I'll also try to have a look over 
the WE / next week. It'd be a great addition!


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-03-28 Thread markap14
Github user markap14 commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@JohannesDaniel thanks for the contribution! This is definitely something 
that I've been wanting to implement for a while but haven't had a chance yet. 
Will be happy to review.


---


[GitHub] nifi issue #2587: NIFI-4185 Add XML Record Reader

2018-03-27 Thread JohannesDaniel
Github user JohannesDaniel commented on the issue:

https://github.com/apache/nifi/pull/2587
  
@pvillard31 here we go :)


---