If you have large FlowFiles and are trying to sample records from each, you can use SampleRecord. It has Interval Sampling, Probabilistic Sampling, and Reservoir Sampling strategies, and I have a PR [1] up to add Range Sampling [2].
Regards, Matt [1] https://github.com/apache/nifi/pull/5878 [2] https://issues.apache.org/jira/browse/NIFI-9814 On Thu, May 19, 2022 at 6:20 AM James McMahon <jsmcmah...@gmail.com> wrote: > > I have been tasked to draw samples from very large raw data sets for triage > analysis. I am to provide multiple sampling methods. Drawing a random sample > of N records is one method. A second method is to draw a fixed sample of > 1,032 records from stratified defined date boundaries in a set. The latter is > of interest because raw data can substantially change structure or even > format at points in time, and we need to be able to sample within those data > boundaries. > > Can anyone offer a link to an example of how nifi may be used to draw samples > randomly and/or in a systematic way from raw data collections?