Joe and Brandon,

Thanks for your input here. I agree that changing the behavior of an existing 
processor (that is used in people’s flows) is a breaking change and probably 
requires a major release, which is why I didn’t do that in the PRs. As written 
today, they are fully backward-compatible. My concern is that users have had 
issues because they attempt to deploy HashAttribute expecting it to perform 
hashing of individual attributes. The introduction of 
CryptographicHashAttribute provides this functionality but the 
discoverability/“do what the name says” issue feels to me like a new addition 
to the “death by 1000 paper cuts” list that any complex project like NiFi has 
to endure.

The CryptographicHashContent processor is fully backward-compatible on 
(expected) functionality, but because of the property descriptor naming, it’s 
not “in-place” replaceable. Instead, (legacy) HashContent is marked as 
deprecated, and moving forward, CHC should be used.

Given your well-described use cases for HA, I think I may be able to provide 
that in CHA as well. I would expect to add a dropdown PD for “attribute 
enumeration style” and offer “individual” (each hash is generated on a single 
attribute), “list” (each hash is generated over an ordered, delimited list of 
literal matches), and “regex” (each hash is generated over an ordered list of 
all attribute names matching the provided regex). Then the dynamic properties 
would describe the output, as happens in the existing PR. Maybe a custom 
delimiter property is needed too, but for now ‘’ could be used to join the 
values. I’ll write up a Jira for this, and hopefully you can both let me know 
if this meets your requirements.

Example:

*Incoming Flowfile*

attributes: [username: “alopresto”, role: “security”, email: 
“[email protected] <mailto:[email protected]>”, git_account: “alopresto”]

*CHA Properties (Individual)*

attribute_enumeration_style: “individual”
(dynamic) username_sha256: “username”
(dynamic) git_account_sha256: “git_account”

*Behavior (Individual)*

username_sha256 = git_account_sha256 = $(echo -n "alopresto" | shasum -a 256) = 
600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23

*Resulting Flowfile (Individual)*

attributes: [username: “alopresto”, role: “security”, email: 
“[email protected] <mailto:[email protected]>”, git_account: “alopresto”, 
username_sha256: 
“600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23”, 
git_account_sha256: 
“600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23"]

*CHA Properties (List)*

attribute_enumeration_style: “list”
(dynamic) username_and_email_sha256: “username, email”
(dynamic) git_account_sha256: “git_account”

*Behavior (List)*

username_and_email_sha256 = $(echo -n "[email protected]" | shasum 
-a 256) = 22a11b7b3173f95c23a1f434949ec2a2e66455b9cb26b7ebc90afca25d91333f
git_account_sha256 = $(echo -n "alopresto" | shasum -a 256) = 
600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23

*Resulting Flowfile (List)*

attributes: [username: “alopresto”, role: “security”, email: 
“[email protected] <mailto:[email protected]>”, git_account: “alopresto”, 
username_email_sha256: “ 
22a11b7b3173f95c23a1f434949ec2a2e66455b9cb26b7ebc90afca25d91333f”, 
git_account_sha256: 
“600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23”]

*CHA Properties (Regex)*

attribute_enumeration_style: “regex”
(dynamic) all_sha256: “.*”
(dynamic) git_account_sha256: “git_account”

*Behavior (Regex)*

all_sha256 = sort(attributes_that_match_regex) = [email, git_account, role, 
username] = $(echo -n "[email protected]" | shasum 
-a 256) = b370fdf0132933cea76e3daa3d4a437bb8c571dd0cd0e79ee5d7759cf64efced
git_account_sha256 = $(echo -n "alopresto" | shasum -a 256) = 
600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23

*Resulting Flowfile (Regex)*

attributes: [username: “alopresto”, role: “security”, email: 
“[email protected] <mailto:[email protected]>”, git_account: “alopresto”, 
all_sha256: “ 
b370fdf0132933cea76e3daa3d4a437bb8c571dd0cd0e79ee5d7759cf64efced”, 
git_account_sha256: 
“600973dc8f2b7bb2a20651ebefe4bf91c5295afef19f4d5b9994d581f5a68a23”]

Mike,

I don’t think it makes sense to remove this functionality and relocate it to 
Elasticsearch. Adding a capacity to calculate a unique identifier over multiple 
inputs may be valuable for Elasticsearch specifically, but the functionality 
described here is independent from that use case, and as NiFi’s processor 
design philosophy is similar to *nix builtins (do one thing; do it well), it 
makes more sense to chain the individual necessary processors.


Andy LoPresto
[email protected]
[email protected]
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Sep 5, 2018, at 9:33 AM, Brandon DeVries <[email protected]> wrote:
> 
> Mike,
> 
> We don't use it with Elasticsearch.
> 
> Fundamentally, it feels like the problem is that this change would break 
> backwards compatibility, which would require a major version bump.  So, in 
> lieu of that, the options are probably 1) use a different name or 2) put the 
> new functionality in HashContent as something that can be toggled on, but 
> leaving the current behavior as the default.
> 
> Brandon
> 
> On Wed, Sep 5, 2018 at 12:21 PM Mike Thomsen <[email protected] 
> <mailto:[email protected]>> wrote:
> Brandon,
> 
> What processor do you use it for in that capacity? If it's an ElasticSearch 
> one we can look into ways to bring this functionality into that bundle so 
> Andy can refactor.
> 
> Thanks,
> 
> Mike
> 
> On Wed, Sep 5, 2018 at 12:07 PM Brandon DeVries <[email protected] 
> <mailto:[email protected]>> wrote:
> Andy,
> 
> We use it pretty much how Joe is... to create a unique composite key.  It 
> seems as though that shouldn't be a difficult functionality to add.  
> Possibly, you could flip your current dynamic key/value properties.  Make the 
> key the name of the attribute you want to create, and the value is the 
> attribute / attributes (newline delimited) that you want to include in the 
> hash.  This does mean you can't use "${algorithm.name 
> <http://algorithm.name/>}" in the name of the created hash attribute, but I 
> don't know if you'd consider that a big loss.  In any case, I'm sure there 
> are other solutions, this is just a thought.
> 
> Brandon
> 
> On Wed, Sep 5, 2018 at 10:27 AM Joe Percivall <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey Andy,
> 
> We're currently using the HashAttribute processor. The use-case is that we 
> have various events that come in but sometimes those events are just updates 
> of previous ones. We store everything in ElasticSearch. So for certain 
> events, we'll calculate a hash based on a couple of attributes in order to 
> have a composite unique key to upsert as the ES _id. This allows us to easily 
> just insert/update events that are the same (as determined by the hashed 
> composite key).
> 
> As for the configuration of the processors, we're essentially just specifying 
> exact attributes as dynamic properties of HashAttribute. Then passing that FF 
> to PutElasticSearchHttp with the resulting attribute from HashAttribute as 
> the "Identifier Attribute".
> 
> Joe
> 
> On Mon, Sep 3, 2018 at 9:52 PM Andy LoPresto <[email protected] 
> <mailto:[email protected]>> wrote:
> I opened PRs for 2980 [1] and 2983 [2] which add more performant, consistent, 
> and full-featured processors to calculate cryptographic hashes of flowfile 
> content and flowfile attributes. I would like to deprecate and drop support 
> for HashAttribute, as it performs a convoluted calculation that was probably 
> useful in an old scenario, but doesn’t “hash attributes” like the name 
> implies. As it blocks the new implementation from using that name and 
> following our naming convention, I am hoping to find anyone still using the 
> old implementation and understand their use case. Thanks for your help.
> 
> [1] https://github.com/apache/nifi/pull/2980 
> <https://github.com/apache/nifi/pull/2980>
> [2] https://github.com/apache/nifi/pull/2983 
> <https://github.com/apache/nifi/pull/2983>
> 
> 
> 
> Andy LoPresto
> [email protected] <mailto:[email protected]>
> [email protected] <mailto:[email protected]>
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
> 
> 
> 
> --
> Joe Percivall
> linkedin.com/in/Percivall <http://linkedin.com/in/Percivall>
> e: [email protected] <mailto:[email protected]>

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to