Brandon DeVries created NIFI-4950:
-------------------------------------

             Summary: MergeContent: Defragment can improperly reassemble
                 Key: NIFI-4950
                 URL: https://issues.apache.org/jira/browse/NIFI-4950
             Project: Apache NiFi
          Issue Type: Bug
          Components: Extensions
    Affects Versions: 1.5.0
            Reporter: Brandon DeVries


In Defragment mode, MergeContent can improperly reassemble the pieces of a 
split file.  I understand this was previously discussed in NIFI-378, and the 
outcome was to update the documentation for fragment.index [1]: 
{quote} Applicable only if the <Merge Strategy> property is set to Defragment. 
This attribute indicates the order in which the fragments should be assembled. 
This attribute must be present on all FlowFiles when using the Defragment Merge 
Strategy and must be a unique (i.e., unique across all FlowFiles that have the 
same value for the "fragment.identifier" attribute) integer between 0 and the 
value of the fragment.count attribute. If two or more FlowFiles have the same 
value for the "fragment.identifier" attribute and the same value for the 
"fragment.index" attribute, the behavior of this Processor is undefined. 
{quote}
I believe this could (and probably should) be improved upon.  Specifically, the 
discussion around NIFI-378 focused on the "improper" use of MergeContent, in 
using the same fragment.identifier to "pair up" files.  The situation I've 
encountered isn't really unusual in any way...

I have a file, being split and sent via PostHTTP to another nifi instance.  If 
something "goes wrong", the sending NiFi may not get an acknowledgement of 
success even if the file made it to the receiving NiFi.  It then sends the 
segment again.  NiFi favors duplication over loss, so this is not unexpected.  
However, I now have a file broken into X fragments arriving on the other side 
as X+1 (or more).  The reassembly may work... or both duplicates may be chosen, 
and result in an incorrectly recreated file.

To satisfy the contract as it exists, you would need to use a DetectDuplicate 
before the MergeContent to filter these out.  However, that could potentially 
incur a great of overhead.  In contrast, simply checking that there are no 
duplicate fragment id's in a bin should be relatively straightforward.  How to 
handle duplicates is a legitimate question... are they ignored,  or are they 
discard (if they're actually the same)?  If the duplicate id's aren't 
identical, what is the behavior? Personally, I would say if you have actual 
duplicates, drop one and continue with the merge... if you have unequal 
"duplicates", fail the bin.  But there's room for discussion there.

The point is, in this circumstance it is very easy for a user to do a very 
reasonable thing and end up with a corrupt file for reasons that are somewhat 
esoteric.  Then, we would need to explain to them why "defragment" doesn't 
actually defragment, but just kind of sorts a bin of matching things.  I think 
we can do better than that.


 [1] 
[http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.MergeContent/index.html]
  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to