Hi Mark and Joe Yesterday morning I implemented Mark's script in my 2 testflows. One testflow using sftp the other MergeContent/UnpackContent. Both testflow are running at a test cluster with 3 nodes and NIFI 1.14.0 The 1st flow with sftp have had 1 file going into the failure queue after about 16 hours. The 2nd flow have had 2 files going into the failure queue after about 15 and 17 hours.
There are definitely something going wrongs in my setup, but I can't figure out what. Information from file 1: histogram.0;0 histogram.1;0 histogram.10;11926720 histogram.100;11927504 histogram.101;11925396 histogram.102;11929923 histogram.103;11931596 histogram.104;11929071 histogram.105;11931365 histogram.106;11928661 histogram.107;11929864 histogram.108;11931611 histogram.109;11932758 histogram.11;0 histogram.110;11927893 histogram.111;11933519 histogram.112;11931392 histogram.113;11928534 histogram.114;11936879 histogram.115;11932818 histogram.116;11934767 histogram.117;11929143 histogram.118;11931854 histogram.119;11926333 histogram.12;0 histogram.120;11928731 histogram.121;11931149 histogram.122;11926725 histogram.123;0 histogram.124;0 histogram.125;0 histogram.126;0 histogram.127;0 histogram.128;0 histogram.129;0 histogram.13;0 histogram.130;0 histogram.131;0 histogram.132;0 histogram.133;0 histogram.134;0 histogram.135;0 histogram.136;0 histogram.137;0 histogram.138;0 histogram.139;0 histogram.14;0 histogram.140;0 histogram.141;0 histogram.142;0 histogram.143;0 histogram.144;0 histogram.145;0 histogram.146;0 histogram.147;0 histogram.148;0 histogram.149;0 histogram.15;0 histogram.150;0 histogram.151;0 histogram.152;0 histogram.153;0 histogram.154;0 histogram.155;0 histogram.156;0 histogram.157;0 histogram.158;0 histogram.159;0 histogram.16;0 histogram.160;0 histogram.161;0 histogram.162;0 histogram.163;0 histogram.164;0 histogram.165;0 histogram.166;0 histogram.167;0 histogram.168;0 histogram.169;0 histogram.17;0 histogram.170;0 histogram.171;0 histogram.172;0 histogram.173;0 histogram.174;0 histogram.175;0 histogram.176;0 histogram.177;0 histogram.178;0 histogram.179;0 histogram.18;0 histogram.180;0 histogram.181;0 histogram.182;0 histogram.183;0 histogram.184;0 histogram.185;0 histogram.186;0 histogram.187;0 histogram.188;0 histogram.189;0 histogram.19;0 histogram.190;0 histogram.191;0 histogram.192;0 histogram.193;0 histogram.194;0 histogram.195;0 histogram.196;0 histogram.197;0 histogram.198;0 histogram.199;0 histogram.2;0 histogram.20;0 histogram.200;0 histogram.201;0 histogram.202;0 histogram.203;0 histogram.204;0 histogram.205;0 histogram.206;0 histogram.207;0 histogram.208;0 histogram.209;0 histogram.21;0 histogram.210;0 histogram.211;0 histogram.212;0 histogram.213;0 histogram.214;0 histogram.215;0 histogram.216;0 histogram.217;0 histogram.218;0 histogram.219;0 histogram.22;0 histogram.220;0 histogram.221;0 histogram.222;0 histogram.223;0 histogram.224;0 histogram.225;0 histogram.226;0 histogram.227;0 histogram.228;0 histogram.229;0 histogram.23;0 histogram.230;0 histogram.231;0 histogram.232;0 histogram.233;0 histogram.234;0 histogram.235;0 histogram.236;0 histogram.237;0 histogram.238;0 histogram.239;0 histogram.24;0 histogram.240;0 histogram.241;0 histogram.242;0 histogram.243;0 histogram.244;0 histogram.245;0 histogram.246;0 histogram.247;0 histogram.248;0 histogram.249;0 histogram.25;0 histogram.250;0 histogram.251;0 histogram.252;0 histogram.253;0 histogram.254;0 histogram.255;0 histogram.26;0 histogram.27;0 histogram.28;0 histogram.29;0 histogram.3;0 histogram.30;0 histogram.31;0 histogram.32;11930422 histogram.33;11934311 histogram.34;11930459 histogram.35;11924776 histogram.36;11924186 histogram.37;11928616 histogram.38;11929474 histogram.39;11929607 histogram.4;0 histogram.40;11928053 histogram.41;11930402 histogram.42;11926830 histogram.43;11938138 histogram.44;11932536 histogram.45;11931053 histogram.46;11930008 histogram.47;11927747 histogram.48;11936055 histogram.49;11931471 histogram.5;0 histogram.50;11931921 histogram.51;11929643 histogram.52;11923847 histogram.53;11927311 histogram.54;11933754 histogram.55;11925964 histogram.56;11928872 histogram.57;11931124 histogram.58;11928474 histogram.59;11925814 histogram.6;0 histogram.60;11933978 histogram.61;11934136 histogram.62;11932016 histogram.63;23864588 histogram.64;11924792 histogram.65;11934789 histogram.66;11933047 histogram.67;11931899 histogram.68;11935615 histogram.69;11927249 histogram.7;0 histogram.70;11933276 histogram.71;11927953 histogram.72;11929275 histogram.73;11930292 histogram.74;11935428 histogram.75;11930317 histogram.76;11935737 histogram.77;11932127 histogram.78;11932344 histogram.79;11932094 histogram.8;0 histogram.80;11930688 histogram.81;11928415 histogram.82;11931559 histogram.83;11934192 histogram.84;11927224 histogram.85;11929491 histogram.86;11930624 histogram.87;11932201 histogram.88;11930694 histogram.89;11936439 histogram.9;11933187 histogram.90;11926445 histogram.91;0 histogram.92;0 histogram.93;0 histogram.94;11931596 histogram.95;11929379 histogram.96;0 histogram.97;11928864 histogram.98;11924738 histogram.99;11930062 histogram.totalBytes;1073741824 File 2: histogram.0;0 histogram.1;0 histogram.10;11932402 histogram.100;11927531 histogram.101;11928454 histogram.102;11934432 histogram.103;11924623 histogram.104;11934492 histogram.105;11934585 histogram.106;11928955 histogram.107;11928651 histogram.108;11930139 histogram.109;11929325 histogram.11;0 histogram.110;11930486 histogram.111;11933517 histogram.112;11928334 histogram.113;11927798 histogram.114;11929222 histogram.115;11932057 histogram.116;11931182 histogram.117;11933407 histogram.118;11932709 histogram.119;11931338 histogram.12;0 histogram.120;11933700 histogram.121;11929803 histogram.122;11930218 histogram.123;0 histogram.124;0 histogram.125;0 histogram.126;0 histogram.127;0 histogram.128;0 histogram.129;0 histogram.13;0 histogram.130;0 histogram.131;0 histogram.132;0 histogram.133;0 histogram.134;0 histogram.135;0 histogram.136;0 histogram.137;0 histogram.138;0 histogram.139;0 histogram.14;0 histogram.140;0 histogram.141;0 histogram.142;0 histogram.143;0 histogram.144;0 histogram.145;0 histogram.146;0 histogram.147;0 histogram.148;0 histogram.149;0 histogram.15;0 histogram.150;0 histogram.151;0 histogram.152;0 histogram.153;0 histogram.154;0 histogram.155;0 histogram.156;0 histogram.157;0 histogram.158;0 histogram.159;0 histogram.16;0 histogram.160;0 histogram.161;0 histogram.162;0 histogram.163;0 histogram.164;0 histogram.165;0 histogram.166;0 histogram.167;0 histogram.168;0 histogram.169;0 histogram.17;0 histogram.170;0 histogram.171;0 histogram.172;0 histogram.173;0 histogram.174;0 histogram.175;0 histogram.176;0 histogram.177;0 histogram.178;0 histogram.179;0 histogram.18;0 histogram.180;0 histogram.181;0 histogram.182;0 histogram.183;0 histogram.184;0 histogram.185;0 histogram.186;0 histogram.187;0 histogram.188;0 histogram.189;0 histogram.19;0 histogram.190;0 histogram.191;0 histogram.192;0 histogram.193;0 histogram.194;0 histogram.195;0 histogram.196;0 histogram.197;0 histogram.198;0 histogram.199;0 histogram.2;0 histogram.20;0 histogram.200;0 histogram.201;0 histogram.202;0 histogram.203;0 histogram.204;0 histogram.205;0 histogram.206;0 histogram.207;0 histogram.208;0 histogram.209;0 histogram.21;0 histogram.210;0 histogram.211;0 histogram.212;0 histogram.213;0 histogram.214;0 histogram.215;0 histogram.216;0 histogram.217;0 histogram.218;0 histogram.219;0 histogram.22;0 histogram.220;0 histogram.221;0 histogram.222;0 histogram.223;0 histogram.224;0 histogram.225;0 histogram.226;0 histogram.227;0 histogram.228;0 histogram.229;0 histogram.23;0 histogram.230;0 histogram.231;0 histogram.232;0 histogram.233;0 histogram.234;0 histogram.235;0 histogram.236;0 histogram.237;0 histogram.238;0 histogram.239;0 histogram.24;0 histogram.240;0 histogram.241;0 histogram.242;0 histogram.243;0 histogram.244;0 histogram.245;0 histogram.246;0 histogram.247;0 histogram.248;0 histogram.249;0 histogram.25;0 histogram.250;0 histogram.251;0 histogram.252;0 histogram.253;0 histogram.254;0 histogram.255;0 histogram.26;0 histogram.27;0 histogram.28;0 histogram.29;0 histogram.3;0 histogram.30;0 histogram.31;0 histogram.32;11924458 histogram.33;11934243 histogram.34;11930696 histogram.35;11925574 histogram.36;11929198 histogram.37;11928146 histogram.38;11932505 histogram.39;11929406 histogram.4;0 histogram.40;11930100 histogram.41;11930867 histogram.42;11930796 histogram.43;11930796 histogram.44;11921866 histogram.45;11935682 histogram.46;11930075 histogram.47;11928169 histogram.48;11933490 histogram.49;11932174 histogram.5;0 histogram.50;11933255 histogram.51;11934009 histogram.52;11928361 histogram.53;11927626 histogram.54;11931611 histogram.55;11930755 histogram.56;11933823 histogram.57;11922508 histogram.58;11930384 histogram.59;11929805 histogram.6;0 histogram.60;11930064 histogram.61;11926761 histogram.62;11927605 histogram.63;23858926 histogram.64;11929516 histogram.65;11930217 histogram.66;11930478 histogram.67;11939855 histogram.68;11927850 histogram.69;11931154 histogram.7;0 histogram.70;11935374 histogram.71;11930754 histogram.72;11928304 histogram.73;11931772 histogram.74;11939417 histogram.75;11930712 histogram.76;11933331 histogram.77;11931279 histogram.78;11928276 histogram.79;11930071 histogram.8;0 histogram.80;11927830 histogram.81;11931213 histogram.82;11930964 histogram.83;11928973 histogram.84;11934325 histogram.85;11929658 histogram.86;11924667 histogram.87;11931100 histogram.88;11930252 histogram.89;11927281 histogram.9;11932848 histogram.90;11930398 histogram.91;0 histogram.92;0 histogram.93;0 histogram.94;11928720 histogram.95;11928988 histogram.96;0 histogram.97;11931423 histogram.98;11928181 histogram.99;11935549 histogram.totalBytes;1073741824 File3: histogram.0;0 histogram.1;0 histogram.10;11930417 histogram.100;11926739 histogram.101;11930580 histogram.102;11928210 histogram.103;11935300 histogram.104;11925804 histogram.105;11931023 histogram.106;11932342 histogram.107;11929778 histogram.108;11930098 histogram.109;11930759 histogram.11;0 histogram.110;11934343 histogram.111;11935775 histogram.112;11933877 histogram.113;11926675 histogram.114;11929332 histogram.115;11928876 histogram.116;11927819 histogram.117;11932657 histogram.118;11933508 histogram.119;11928808 histogram.12;0 histogram.120;11937532 histogram.121;11926907 histogram.122;11933942 histogram.123;0 histogram.124;0 histogram.125;0 histogram.126;0 histogram.127;0 histogram.128;0 histogram.129;0 histogram.13;0 histogram.130;0 histogram.131;0 histogram.132;0 histogram.133;0 histogram.134;0 histogram.135;0 histogram.136;0 histogram.137;0 histogram.138;0 histogram.139;0 histogram.14;0 histogram.140;0 histogram.141;0 histogram.142;0 histogram.143;0 histogram.144;0 histogram.145;0 histogram.146;0 histogram.147;0 histogram.148;0 histogram.149;0 histogram.15;0 histogram.150;0 histogram.151;0 histogram.152;0 histogram.153;0 histogram.154;0 histogram.155;0 histogram.156;0 histogram.157;0 histogram.158;0 histogram.159;0 histogram.16;0 histogram.160;0 histogram.161;0 histogram.162;0 histogram.163;0 histogram.164;0 histogram.165;0 histogram.166;0 histogram.167;0 histogram.168;0 histogram.169;0 histogram.17;0 histogram.170;0 histogram.171;0 histogram.172;0 histogram.173;0 histogram.174;0 histogram.175;0 histogram.176;0 histogram.177;0 histogram.178;0 histogram.179;0 histogram.18;0 histogram.180;0 histogram.181;0 histogram.182;0 histogram.183;0 histogram.184;0 histogram.185;0 histogram.186;0 histogram.187;0 histogram.188;0 histogram.189;0 histogram.19;0 histogram.190;0 histogram.191;0 histogram.192;0 histogram.193;0 histogram.194;0 histogram.195;0 histogram.196;0 histogram.197;0 histogram.198;0 histogram.199;0 histogram.2;0 histogram.20;0 histogram.200;0 histogram.201;0 histogram.202;0 histogram.203;0 histogram.204;0 histogram.205;0 histogram.206;0 histogram.207;0 histogram.208;0 histogram.209;0 histogram.21;0 histogram.210;0 histogram.211;0 histogram.212;0 histogram.213;0 histogram.214;0 histogram.215;0 histogram.216;0 histogram.217;0 histogram.218;0 histogram.219;0 histogram.22;0 histogram.220;0 histogram.221;0 histogram.222;0 histogram.223;0 histogram.224;0 histogram.225;0 histogram.226;0 histogram.227;0 histogram.228;0 histogram.229;0 histogram.23;0 histogram.230;0 histogram.231;0 histogram.232;0 histogram.233;0 histogram.234;0 histogram.235;0 histogram.236;0 histogram.237;0 histogram.238;0 histogram.239;0 histogram.24;0 histogram.240;0 histogram.241;0 histogram.242;0 histogram.243;0 histogram.244;0 histogram.245;0 histogram.246;0 histogram.247;0 histogram.248;0 histogram.249;0 histogram.25;0 histogram.250;0 histogram.251;0 histogram.252;0 histogram.253;0 histogram.254;0 histogram.255;0 histogram.26;0 histogram.27;0 histogram.28;0 histogram.29;0 histogram.3;0 histogram.30;0 histogram.31;0 histogram.32;11929486 histogram.33;11930737 histogram.34;11931092 histogram.35;11934488 histogram.36;11927605 histogram.37;11930735 histogram.38;11932174 histogram.39;11936180 histogram.4;0 histogram.40;11931666 histogram.41;11927043 histogram.42;11929044 histogram.43;11934104 histogram.44;11936337 histogram.45;11935580 histogram.46;11929598 histogram.47;11934083 histogram.48;11928858 histogram.49;11931098 histogram.5;0 histogram.50;11930618 histogram.51;11925429 histogram.52;11929741 histogram.53;11934160 histogram.54;11931999 histogram.55;11930465 histogram.56;11926194 histogram.57;11926386 histogram.58;11924871 histogram.59;11929331 histogram.6;0 histogram.60;11926951 histogram.61;11928631 histogram.62;11927549 histogram.63;23856730 histogram.64;11930288 histogram.65;11931523 histogram.66;11932821 histogram.67;11932509 histogram.68;11929613 histogram.69;11928651 histogram.7;0 histogram.70;11929253 histogram.71;11931521 histogram.72;11925805 histogram.73;11934833 histogram.74;11928314 histogram.75;11923854 histogram.76;11930892 histogram.77;11927528 histogram.78;11932850 histogram.79;11934471 histogram.8;0 histogram.80;11925707 histogram.81;11929213 histogram.82;11931334 histogram.83;11936739 histogram.84;11927855 histogram.85;11931668 histogram.86;11928609 histogram.87;11931930 histogram.88;11934341 histogram.89;11927519 histogram.9;11928004 histogram.90;11933502 histogram.91;0 histogram.92;0 histogram.93;0 histogram.94;11932024 histogram.95;11932693 histogram.96;0 histogram.97;11928428 histogram.98;11933195 histogram.99;11924273 histogram.totalBytes;1073741824 Kind regards Jens Den søn. 31. okt. 2021 kl. 21.40 skrev Joe Witt <[email protected]>: > Jen > > 118 hours in - still goood. > > Thanks > > On Fri, Oct 29, 2021 at 10:22 AM Joe Witt <[email protected]> wrote: > > > > Jens > > > > Update from hour 67. Still lookin' good. > > > > Will advise. > > > > Thanks > > > > On Thu, Oct 28, 2021 at 8:08 AM Jens M. Kofoed <[email protected]> > wrote: > > > > > > Many many thanks 🙏 Joe for looking into this. My test flow was > running for 6 days before the first error occurred > > > > > > Thanks > > > > > > > Den 28. okt. 2021 kl. 16.57 skrev Joe Witt <[email protected]>: > > > > > > > > Jens, > > > > > > > > Am 40+ hours in running both your flow and mine to reproduce. So far > > > > neither have shown any sign of trouble. Will keep running for > another > > > > week or so if I can. > > > > > > > > Thanks > > > > > > > >> On Wed, Oct 27, 2021 at 12:42 PM Jens M. Kofoed < > [email protected]> wrote: > > > >> > > > >> The Physical hosts with VMWare is using the vmfs but the vm > machines running at hosts can’t see that. > > > >> But you asked about the underlying file system 😀 and since my > first answer with the copy from the fstab file wasn’t enough I just wanted > to give all the details 😁. > > > >> > > > >> If you create a vm for windows you would probably use NTFS (on top > of vmfs). For Linux EXT3, EXT4, BTRFS, XFS and so on. > > > >> > > > >> All the partitions at my nifi nodes, are local devices (sda, sdb, > sdc and sdd) for each Linux machine. I don’t use nfs > > > >> > > > >> Kind regards > > > >> Jens > > > >> > > > >> > > > >> > > > >> Den 27. okt. 2021 kl. 17.47 skrev Joe Witt <[email protected]>: > > > >> > > > >> Jens, > > > >> > > > >> I don't quite follow the EXT4 usage on top of VMFS but the point > here > > > >> is you'll ultimately need to truly understand your underlying > storage > > > >> system and what sorts of guarantees it is giving you. If linux/the > > > >> jvm/nifi think it has a typical EXT4 type block storage system to > work > > > >> with it can only be safe/operate within those constraints. I have > no > > > >> idea about what VMFS brings to the table or the settings for it. > > > >> > > > >> The sync properties I shared previously might help force the issue > of > > > >> ensuring a formal sync/flush cycle all the way through the disk has > > > >> occurred which we'd normally not do or need to do but again in some > > > >> cases offers a stronger guarantee in exchange for performance. > > > >> > > > >> In any case...Mark's path for you here will help identify what we're > > > >> dealing with and we can go from there. > > > >> > > > >> I am aware of significant usage of NiFi on VMWare configurations > > > >> without issue at high rates for many years so whatever it is here is > > > >> likely solvable. > > > >> > > > >> Thanks > > > >> > > > >> On Wed, Oct 27, 2021 at 7:28 AM Jens M. Kofoed < > [email protected]> wrote: > > > >> > > > >> > > > >> Hi Mark > > > >> > > > >> > > > >> Thanks for the clarification. I will implement the script when I > return to the office at Monday next week ( November 1st). > > > >> > > > >> I don’t use NFS, but ext4. But I will implement the script so we > can check if it’s the case here. But I think the issue might be after the > processors writing content to the repository. > > > >> > > > >> I have a test flow running for more than 2 weeks without any > errors. But this flow only calculate hash and comparing. > > > >> > > > >> > > > >> Two other flows both create errors. One flow use > PutSFTP->FetchSFTP->CryptographicHashContent->compares. The other flow use > MergeContent->UnpackContent->CryptographicHashContent->compares. The last > flow is totally inside nifi, excluding other network/server issues. > > > >> > > > >> > > > >> In both cases the CryptographicHashContent is right after a process > which writes new content to the repository. But in one case a file in our > production flow did calculate a wrong hash 4 times with a 1 minutes delay > between each calculation. A few hours later I looped the file back and this > time it was OK. > > > >> > > > >> Just like the case in step 5 and 12 in the pdf file > > > >> > > > >> > > > >> I will let you all know more later next week > > > >> > > > >> > > > >> Kind regards > > > >> > > > >> Jens > > > >> > > > >> > > > >> > > > >> > > > >> Den 27. okt. 2021 kl. 15.43 skrev Mark Payne <[email protected] > >: > > > >> > > > >> > > > >> And the actual script: > > > >> > > > >> > > > >> > > > >> import org.apache.nifi.flowfile.FlowFile > > > >> > > > >> > > > >> import java.util.stream.Collectors > > > >> > > > >> > > > >> Map<String, String> getPreviousHistogram(final FlowFile flowFile) { > > > >> > > > >> final Map<String, String> histogram = > flowFile.getAttributes().entrySet().stream() > > > >> > > > >> .filter({ entry -> entry.getKey().startsWith("histogram.") }) > > > >> > > > >> .collect(Collectors.toMap({ entry -> entry.key}, { entry -> > entry.value })) > > > >> > > > >> return histogram; > > > >> > > > >> } > > > >> > > > >> > > > >> Map<String, String> createHistogram(final FlowFile flowFile, final > InputStream inStream) { > > > >> > > > >> final Map<String, String> histogram = new HashMap<>(); > > > >> > > > >> final int[] distribution = new int[256]; > > > >> > > > >> Arrays.fill(distribution, 0); > > > >> > > > >> > > > >> long total = 0L; > > > >> > > > >> final byte[] buffer = new byte[8192]; > > > >> > > > >> int len; > > > >> > > > >> while ((len = inStream.read(buffer)) > 0) { > > > >> > > > >> for (int i=0; i < len; i++) { > > > >> > > > >> final int val = buffer[i]; > > > >> > > > >> distribution[val]++; > > > >> > > > >> total++; > > > >> > > > >> } > > > >> > > > >> } > > > >> > > > >> > > > >> for (int i=0; i < 256; i++) { > > > >> > > > >> histogram.put("histogram." + i, > String.valueOf(distribution[i])); > > > >> > > > >> } > > > >> > > > >> histogram.put("histogram.totalBytes", String.valueOf(total)); > > > >> > > > >> > > > >> return histogram; > > > >> > > > >> } > > > >> > > > >> > > > >> void logHistogramDifferences(final Map<String, String> previous, > final Map<String, String> updated) { > > > >> > > > >> final StringBuilder sb = new StringBuilder("There are differences > in the histogram\n"); > > > >> > > > >> final Map<String, String> sorted = new TreeMap<>(previous) > > > >> > > > >> for (final Map.Entry<String, String> entry : sorted.entrySet()) { > > > >> > > > >> final String key = entry.getKey(); > > > >> > > > >> final String previousValue = entry.getValue(); > > > >> > > > >> final String updatedValue = updated.get(entry.getKey()) > > > >> > > > >> > > > >> if (!Objects.equals(previousValue, updatedValue)) { > > > >> > > > >> sb.append("Byte Value: ").append(key).append(", Previous > Count: ").append(previousValue).append(", New Count: > ").append(updatedValue).append("\n"); > > > >> > > > >> } > > > >> > > > >> } > > > >> > > > >> > > > >> log.error(sb.toString()); > > > >> > > > >> } > > > >> > > > >> > > > >> > > > >> def flowFile = session.get() > > > >> > > > >> if (flowFile == null) { > > > >> > > > >> return > > > >> > > > >> } > > > >> > > > >> > > > >> final Map<String, String> previousHistogram = > getPreviousHistogram(flowFile) > > > >> > > > >> Map<String, String> histogram = null; > > > >> > > > >> > > > >> final InputStream inStream = session.read(flowFile); > > > >> > > > >> try { > > > >> > > > >> histogram = createHistogram(flowFile, inStream); > > > >> > > > >> } finally { > > > >> > > > >> inStream.close() > > > >> > > > >> } > > > >> > > > >> > > > >> if (!previousHistogram.isEmpty()) { > > > >> > > > >> if (previousHistogram.equals(histogram)) { > > > >> > > > >> log.info("Histograms match") > > > >> > > > >> } else { > > > >> > > > >> logHistogramDifferences(previousHistogram, histogram) > > > >> > > > >> session.transfer(flowFile, REL_FAILURE) > > > >> > > > >> return; > > > >> > > > >> } > > > >> > > > >> } > > > >> > > > >> > > > >> flowFile = session.putAllAttributes(flowFile, histogram) > > > >> > > > >> session.transfer(flowFile, REL_SUCCESS) > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Oct 27, 2021, at 9:43 AM, Mark Payne <[email protected]> > wrote: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> For a bit of background here, the reason that Joe and I have > expressed interest in NFS file systems is that the way the protocol works, > it is allowed to receive packets/chunks of the file out-of-order. So, what > happens is let’s say a 1 MB file is being written. The first 500 KB are > received. Then instead of the the 501st KB it receives the 503rd KB. What > happens is that the size of the file on the file system becomes 503 KB. But > what about 501 & 502? Well when you read the data, the file system just > returns ASCII NUL characters (byte 0) for those bytes. Once the NFS server > receives those bytes, it then goes back and fills in the proper bytes. So > if you’re running on NFS, it is possible for the contents of the file on > the underlying file system to change out from under you. It’s not clear to > me what other types of file system might do something similar. > > > >> > > > >> > > > >> So, one thing that we can do is to find out whether or not the > contents of the underlying file have changed in some way, or if there’s > something else happening that could perhaps result in the hashes being > wrong. I’ve put together a script that should help diagnose this. > > > >> > > > >> > > > >> Can you insert an ExecuteScript processor either just before or > just after your CryptographicHashContent processor? Doesn’t really matter > whether it’s run just before or just after. I’ll attach the script here. > It’s a Groovy Script so you should be able to use ExecuteScript with Script > Engine = Groovy and the following script as the Script Body. No other > changes needed. > > > >> > > > >> > > > >> The way the script works, it reads in the contents of the FlowFile, > and then it builds up a histogram of all byte values (0-255) that it sees > in the contents, and then adds that as attributes. So it adds attributes > such as: > > > >> > > > >> histogram.0 = 280273 > > > >> > > > >> histogram.1 = 2820 > > > >> > > > >> histogram.2 = 48202 > > > >> > > > >> histogram.3 = 3820 > > > >> > > > >> … > > > >> > > > >> histogram.totalBytes = 1780928732 > > > >> > > > >> > > > >> It then checks if those attributes have already been added. If so, > after calculating that histogram, it checks against the previous values (in > the attributes). If they are the same, the FlowFile goes to ’success’. If > they are different, it logs an error indicating the before/after value for > any byte whose distribution was different, and it routes to failure. > > > >> > > > >> > > > >> So, if for example, the first time through it sees 280,273 bytes > with a value of ‘0’, and the second times it only sees 12,001 then we know > there were a bunch of 0’s previously that were updated to be some other > value. And it includes the total number of bytes in case somehow we find > that we’re reading too many bytes or not enough bytes or something like > that. This should help narrow down what’s happening. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> -Mark > > > >> > > > >> > > > >> > > > >> > > > >> On Oct 26, 2021, at 6:25 PM, Joe Witt <[email protected]> wrote: > > > >> > > > >> > > > >> Jens > > > >> > > > >> > > > >> Attached is the flow I was using (now running yours and this one). > Curious if that one reproduces the issue for you as well. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> > > > >> On Tue, Oct 26, 2021 at 3:09 PM Joe Witt <[email protected]> > wrote: > > > >> > > > >> > > > >> Jens > > > >> > > > >> > > > >> I have your flow running and will keep it running for several > days/week to see if I can reproduce. Also of note please use your same > test flow but use HashContent instead of crypto hash. Curious if that > matters for any reason... > > > >> > > > >> > > > >> Still want to know more about your underlying storage system. > > > >> > > > >> > > > >> You could also try updating nifi.properties and changing the > following lines: > > > >> > > > >> nifi.flowfile.repository.always.sync=true > > > >> > > > >> nifi.content.repository.always.sync=true > > > >> > > > >> nifi.provenance.repository.always.sync=true > > > >> > > > >> > > > >> It will hurt performance but can be useful/necessary on certain > storage subsystems. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> > > > >> On Tue, Oct 26, 2021 at 12:05 PM Joe Witt <[email protected]> > wrote: > > > >> > > > >> > > > >> Ignore "For the scenario where you can replicate this please share > the flow.xml.gz for which it is reproducible." I see the uploaded JSON > > > >> > > > >> > > > >> On Tue, Oct 26, 2021 at 12:04 PM Joe Witt <[email protected]> > wrote: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> We asked about the underlying storage system. You replied with > some info but not the specifics. Do you know precisely what the underlying > storage is and how it is presented to the operating system? For instance > is it NFS or something similar? > > > >> > > > >> > > > >> I've setup a very similar flow at extremely high rates running for > the past several days with no issue. In my case though I know precisely > what the config is and the disk setup is. Didn't do anything special to be > clear but still it is important to know. > > > >> > > > >> > > > >> For the scenario where you can replicate this please share the > flow.xml.gz for which it is reproducible. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> Joe > > > >> > > > >> > > > >> On Sun, Oct 24, 2021 at 9:53 PM Jens M. Kofoed < > [email protected]> wrote: > > > >> > > > >> > > > >> Dear Joe and Mark > > > >> > > > >> > > > >> I have created a test flow without the sftp processors, which don't > create any errors. Therefore I created a new test flow where I use a > MergeContent and UnpackContent instead of the sftp processors. This keeps > all data internal in NIFI, but force NIFI to write and read new files > totally local. > > > >> > > > >> My flow have been running for 7 days and this morning there where 2 > files where the sha256 has been given another has value than original. I > have set this flow up in another nifi cluster only for testing, and the > cluster is not doing anything else. It is using Nifi 1.14.0 > > > >> > > > >> So I can reproduce issues at different nifi clusters and versions > (1.13.2 and 1.14.0) where the calculation of a hash on content can give > different outputs. Is doesn't make any sense, but it happens. In all my > cases the issues happens where the calculations of the hashcontent happens > right after NIFI writes the content to the content repository. I don't know > if there cut be some kind of delay writing the content 100% before the next > processors begin reading the content??? > > > >> > > > >> > > > >> Please see attach test flow, and the previous mail with a pdf > showing the lineage of a production file which also had issues. In the pdf > check step 5 and 12. > > > >> > > > >> > > > >> Kind regards > > > >> > > > >> Jens M. Kofoed > > > >> > > > >> > > > >> > > > >> Den tor. 21. okt. 2021 kl. 08.28 skrev Jens M. Kofoed < > [email protected]>: > > > >> > > > >> > > > >> Joe, > > > >> > > > >> > > > >> To start from the last mail :-) > > > >> > > > >> All the repositories has it's own disk, and I'm using ext4 > > > >> > > > >> /dev/VG_b/LV_b /nifiRepo ext4 defaults,noatime 0 0 > > > >> > > > >> /dev/VG_c/LV_c /provRepo01 ext4 defaults,noatime 0 0 > > > >> > > > >> /dev/VG_d/LV_d /contRepo01 ext4 defaults,noatime 0 0 > > > >> > > > >> > > > >> My test flow WITH sftp looks like this: > > > >> > > > >> <image.png> > > > >> > > > >> And this flow has produced 1 error within 3 days. After many many > loops the file fails and went out via the "unmatched" output to the > disabled UpdateAttribute, which is doing nothing. Just for keeping the > failed flowfile in a queue. I enabled the UpdateAttribute and looped the > file back to the CryptographicHashContent and now it calculated the hash > correct again. But in this flow I have a FetchSFTP Process right before the > Hashing. > > > >> > > > >> Right now my flow is running without the 2 sftp processors, and the > last 24hours there has been no errors. > > > >> > > > >> > > > >> About the Lineage: > > > >> > > > >> Are there a way to export all the lineage data? The export only > generate a svg file. > > > >> > > > >> This is only for the receiving nifi which is internally calculate 2 > different hashes on the same content with ca. 1 minutes delay. Attached is > a pdf-document with the lineage, the flow and all the relevant Provenance > information's for each step in the lineage. > > > >> > > > >> The interesting steps are step 5 and 12. > > > >> > > > >> > > > >> Can the issues be that data is not written 100% to disk between > step 4 and 5 in the flow? > > > >> > > > >> > > > >> Kind regards > > > >> > > > >> Jens M. Kofoed > > > >> > > > >> > > > >> > > > >> > > > >> Den ons. 20. okt. 2021 kl. 23.49 skrev Joe Witt <[email protected] > >: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> Also what type of file system/storage system are you running NiFi on > > > >> > > > >> in this case? We'll need to know this for the NiFi > > > >> > > > >> content/flowfile/provenance repositories? Is it NFS? > > > >> > > > >> > > > >> Thanks > > > >> > > > >> > > > >> On Wed, Oct 20, 2021 at 11:14 AM Joe Witt <[email protected]> > wrote: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> And to further narrow this down > > > >> > > > >> > > > >> "I have a test flow, where a GenerateFlowfile has created 6x 1GB > files > > > >> > > > >> (2 files per node) and next process was a hashcontent before it run > > > >> > > > >> into a test loop. Where files are uploaded via PutSFTP to a test > > > >> > > > >> server, and downloaded again and recalculated the hash. I have had > one > > > >> > > > >> issue after 3 days of running." > > > >> > > > >> > > > >> So to be clear with GenerateFlowFile making these files and then you > > > >> > > > >> looping the content is wholly and fully exclusively within the > control > > > >> > > > >> of NiFI. No Get/Fetch/Put-SFTP of any kind at all. In by looping > the > > > >> > > > >> same files over and over in nifi itself you can make this happen or > > > >> > > > >> cannot? > > > >> > > > >> > > > >> Thanks > > > >> > > > >> > > > >> On Wed, Oct 20, 2021 at 11:08 AM Joe Witt <[email protected]> > wrote: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> "After fetching a FlowFile-stream file and unpacked it back into > NiFi > > > >> > > > >> I calculate a sha256. 1 minutes later I recalculate the sha256 on > the > > > >> > > > >> exact same file. And got a new hash. That is what worry’s me. > > > >> > > > >> The fact that the same file can be recalculated and produce two > > > >> > > > >> different hashes, is very strange, but it happens. " > > > >> > > > >> > > > >> Ok so to confirm you are saying that in each case this happens you > see > > > >> > > > >> it first compute the wrong hash, but then if you retry the same > > > >> > > > >> flowfile it then provides the correct hash? > > > >> > > > >> > > > >> Can you please also show/share the lineage history for such a flow > > > >> > > > >> file then? It should have events for the initial hash, second hash, > > > >> > > > >> the unpacking, trace to the original stream, etc... > > > >> > > > >> > > > >> Thanks > > > >> > > > >> > > > >> On Wed, Oct 20, 2021 at 11:00 AM Jens M. Kofoed < > [email protected]> wrote: > > > >> > > > >> > > > >> Dear Mark and Joe > > > >> > > > >> > > > >> I know my setup isn’t normal for many people. But if we only looks > at my receive side, which the last mails is about. Every thing is happening > at the same NIFI instance. It is the same 3 node NIFI cluster. > > > >> > > > >> After fetching a FlowFile-stream file and unpacked it back into > NiFi I calculate a sha256. 1 minutes later I recalculate the sha256 on the > exact same file. And got a new hash. That is what worry’s me. > > > >> > > > >> The fact that the same file can be recalculated and produce two > different hashes, is very strange, but it happens. Over the last 5 months > it have only happen 35-40 times. > > > >> > > > >> > > > >> I can understand if the file is not completely loaded and saved > into the content repository before the hashing starts. But I believe that > the unpack process don’t forward the flow file to the next process before > it is 100% finish unpacking and saving the new content to the repository. > > > >> > > > >> > > > >> I have a test flow, where a GenerateFlowfile has created 6x 1GB > files (2 files per node) and next process was a hashcontent before it run > into a test loop. Where files are uploaded via PutSFTP to a test server, > and downloaded again and recalculated the hash. I have had one issue after > 3 days of running. > > > >> > > > >> Now the test flow is running without the Put/Fetch sftp processors. > > > >> > > > >> > > > >> Another problem is that I can’t find any correlation to other > events. Not within NIFI, nor the server itself or VMWare. If I just could > find any other event which happens at the same time, I might be able to > force some kind of event to trigger the issue. > > > >> > > > >> I have tried to force VMware to migrate a NiFi node to another > host. Forcing it to do a snapshot and deleting snapshots, but nothing can > trigger and error. > > > >> > > > >> > > > >> I know it will be very very difficult to reproduce. But I will > setup multiple NiFi instances running different test flows to see if I can > find any reason why it behaves as it does. > > > >> > > > >> > > > >> Kind Regards > > > >> > > > >> Jens M. Kofoed > > > >> > > > >> > > > >> Den 20. okt. 2021 kl. 16.39 skrev Mark Payne <[email protected] > >: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> Thanks for sharing the images. > > > >> > > > >> > > > >> I tried to setup a test to reproduce the issue. I’ve had it running > for quite some time. Running through millions of iterations. > > > >> > > > >> > > > >> I’ve used 5 KB files, 50 KB files, 50 MB files, and larger (to the > tune of hundreds of MB). I’ve been unable to reproduce an issue after > millions of iterations. > > > >> > > > >> > > > >> So far I cannot replicate. And since you’re pulling the data via > SFTP and then unpacking, which preserves all original attributes from a > different system, this can easily become confusing. > > > >> > > > >> > > > >> Recommend trying to reproduce with SFTP-related processors out of > the picture, as Joe is mentioning. Either using GetFile/FetchFile or > GenerateFlowFile. Then immediately use CryptographicHashContent to generate > an ‘initial hash’, copy that value to another attribute, and then loop, > generating the hash and comparing against the original one. I’ll attach a > flow that does this, but not sure if the email server will strip out the > attachment or not. > > > >> > > > >> > > > >> This way we remove any possibility of actual corruption between the > two nifi instances. If we can still see corruption / different hashes > within a single nifi instance, then it certainly warrants further > investigation but i can’t see any issues so far. > > > >> > > > >> > > > >> Thanks > > > >> > > > >> -Mark > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> On Oct 20, 2021, at 10:21 AM, Joe Witt <[email protected]> wrote: > > > >> > > > >> > > > >> Jens > > > >> > > > >> > > > >> Actually is this current loop test contained within a single nifi > and there you see corruption happen? > > > >> > > > >> > > > >> Joe > > > >> > > > >> > > > >> On Wed, Oct 20, 2021 at 7:14 AM Joe Witt <[email protected]> > wrote: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> You have a very involved setup including other systems (non NiFi). > Have you removed those systems from the equation so you have more evidence > to support your expectation that NiFi is doing something other than you > expect? > > > >> > > > >> > > > >> Joe > > > >> > > > >> > > > >> On Wed, Oct 20, 2021 at 7:10 AM Jens M. Kofoed < > [email protected]> wrote: > > > >> > > > >> > > > >> Hi > > > >> > > > >> > > > >> Today I have another file which have been running through the retry > loop one time. To test the processors and the algorithm I added the > HashContent processor and also added hashing by SHA-1. > > > >> > > > >> I file have been going through the system, and both the SHA-1 and > SHA-256 are both different than expected. with a 1 minutes delay the file > is going back into the hashing content flow and this time it calculates > both hashes fine. > > > >> > > > >> > > > >> I don't believe that the hashing is buggy, but something is very > very strange. What can influence the processors/algorithm to calculate a > different hash??? > > > >> > > > >> All the input/output claim information is exactly the same. It is > the same flow/content file going in a loop. It happens on all 3 nodes. > > > >> > > > >> > > > >> Any suggestions for where to dig ? > > > >> > > > >> > > > >> Regards > > > >> > > > >> Jens M. Kofoed > > > >> > > > >> > > > >> > > > >> > > > >> Den ons. 20. okt. 2021 kl. 06.34 skrev Jens M. Kofoed < > [email protected]>: > > > >> > > > >> > > > >> Hi Mark > > > >> > > > >> > > > >> Thanks for replaying and the suggestion to look at the content > Claim. > > > >> > > > >> These 3 pictures is from the first attempt: > > > >> > > > >> <image.png> <image.png> <image.png> > > > >> > > > >> > > > >> Yesterday I realized that the content was still in the archive, so > I could Replay the file. > > > >> > > > >> <image.png> > > > >> > > > >> So here are the same pictures but for the replay and as you can see > the Identifier, offset and Size are all the same. > > > >> > > > >> <image.png> <image.png> <image.png> > > > >> > > > >> > > > >> In my flow if the hash does not match my original first calculated > hash, it goes into a retry loop. Here are the pictures for the 4th time the > file went through: > > > >> > > > >> <image.png> <image.png> <image.png> > > > >> > > > >> Here the content Claim is all the same. > > > >> > > > >> > > > >> It is very rare that we see these issues <1 : 1.000.000 files and > only with large files. Only once have I seen the error with a 110MB file, > the other times the files size are above 800MB. > > > >> > > > >> This time it was a Nifi-Flowstream v3 file, which has been exported > from one system and imported in another. But while the file has been > imported it is the same file inside NIFI and it stays at the same node. > Going through the same loop of processors multiple times and in the end the > CryptographicHashContent calculate a different SHA256 than it did earlier. > This should not be possible!!! And that is what concern my the most. > > > >> > > > >> What can influence the same processor to calculate 2 different > sha256 on the exact same content??? > > > >> > > > >> > > > >> Regards > > > >> > > > >> Jens M. Kofoed > > > >> > > > >> > > > >> > > > >> Den tir. 19. okt. 2021 kl. 16.51 skrev Mark Payne < > [email protected]>: > > > >> > > > >> > > > >> Jens, > > > >> > > > >> > > > >> In the two provenance events - one showing a hash of dd4cc… and the > other showing f6f0…. > > > >> > > > >> If you go to the Content tab, do they both show the same Content > Claim? I.e., do the Input Claim / Output Claim show the same values for > Container, Section, Identifier, Offset, and Size? > > > >> > > > >> > > > >> Thanks > > > >> > > > >> -Mark > > > >> > > > >> > > > >> On Oct 19, 2021, at 1:22 AM, Jens M. Kofoed <[email protected]> > wrote: > > > >> > > > >> > > > >> Dear NIFI Users > > > >> > > > >> > > > >> I have posted this mail in the developers mailing list and just > want to inform all of our about a very odd behavior we are facing. > > > >> > > > >> The background: > > > >> > > > >> We have data going between 2 different NIFI systems which has no > direct network access to each other. Therefore we calculate a SHA256 hash > value of the content at system 1, before the flowfile and data are combined > and saved as a "flowfile-stream-v3" pkg file. The file is then transported > to system 2, where the pkg file is unpacked and the flow can continue. To > be sure about file integrity we calculate a new sha256 at system 2. But > sometimes we see that the sha256 gets another value, which might suggest > the file was corrupted. But recalculating the sha256 again gives a new hash > value. > > > >> > > > >> > > > >> ---- > > > >> > > > >> > > > >> Tonight I had yet another file which didn't match the expected > sha256 hash value. The content is a 1.7GB file and the Event Duration was > "00:00:17.539" to calculate the hash. > > > >> > > > >> I have created a Retry loop, where the file will go to a Wait > process for delaying the file 1 minute and going back to the > CryptographicHashContent for a new calculation. After 3 retries the file > goes to the retries_exceeded and goes to a disabled process just to be in a > queue so I manually can look at it. This morning I rerouted the file from > my retries_exceeded queue back to the CryptographicHashContent for a new > calculation and this time it calculated the correct hash value. > > > >> > > > >> > > > >> THIS CAN'T BE TRUE :-( :-( But it is. - Something very very strange > is happening. > > > >> > > > >> <image.png> > > > >> > > > >> > > > >> We are running NiFi 1.13.2 in a 3 node cluster at Ubuntu 20.04.02 > with openjdk version "1.8.0_292", OpenJDK Runtime Environment (build > 1.8.0_292-8u292-b10-0ubuntu1~20.04-b10), OpenJDK 64-Bit Server VM (build > 25.292-b10, mixed mode). Each server is a VM with 4 CPU, 8GB Ram on VMware > ESXi, 7.0.2. Each NIFI node is running at different vm physical hosts. > > > >> > > > >> I have inspected different logs to see if I can find any > correlation what happened at the same time as the file is going through my > loop, but there are no event/task at that exact time. > > > >> > > > >> > > > >> System 1: > > > >> > > > >> At 10/19/2021 00:15:11.247 CEST my file is going through a > CryptographicHashContent: SHA256 value: > dd4cc7ef8dbc8d70528e8aa788581f0ab88d297c9c9f39b6b542df68952efd20 > > > >> > > > >> The file is exported as a "FlowFile Stream, v3" to System 2 > > > >> > > > >> > > > >> SYSTEM 2: > > > >> > > > >> At 10/19/2021 00:18:10.528 CEST the file is going through a > CryptographicHashContent: SHA256 value: > f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819 > > > >> > > > >> <image.png> > > > >> > > > >> At 10/19/2021 00:19:08.996 CEST the file is going through the same > CryptographicHashContent at system 2: SHA256 value: > f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819 > > > >> > > > >> At 10/19/2021 00:20:04.376 CEST the file is going through the same > a CryptographicHashContent at system 2: SHA256 value: > f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819 > > > >> > > > >> At 10/19/2021 00:21:01.711 CEST the file is going through the same > a CryptographicHashContent at system 2: SHA256 value: > f6f0909aacae4952f10f6fa7704f3e55d0481ec211d495993550aedbb3fe0819 > > > >> > > > >> > > > >> At 10/19/2021 06:07:43.376 CEST the file is going through the same > a CryptographicHashContent at system 2: SHA256 value: > dd4cc7ef8dbc8d70528e8aa788581f0ab88d297c9c9f39b6b542df68952efd20 > > > >> > > > >> <image.png> > > > >> > > > >> > > > >> How on earth can this happen??? > > > >> > > > >> > > > >> Kind Regards > > > >> > > > >> Jens M. Kofoed > > > >> > > > >> > > > >> > > > >> > > > >> <Repro.json> > > > >> > > > >> > > > >> <Try_to_recreate_Jens_Challenge.json> > > > >> > > > >> > > > >> > > > >> >
