MergeContent - Question on interplay of max bins, duration, and size with numerous correlation variable values

Mark Petronic Thu, 05 Nov 2015 21:06:43 -0800

I was expecting that, if is set min bin size to 128 mb and max to 512 mb
and bin duration to 60s and max bins to 100 and if data was flowing quick
enough so that I received more than 512 MB in 60 sec (all flow files are
keyed to the same correlation variable in the case), that I would see
output flow files of around the max of 512 mb. But that is not what I see.
I played around with changing the max bins and duration but still don't
seem to be able to "force" large files. Instead I see files around 100 -150
mb. Can someone point me to a more detailed description of how the binning
logic works? Would like to understand the interplay between the number of
bins, duration, and size when you have sets of flow files coming in that
are linked to different correlation variables. In my case, if I process all
my file types, I have about 19 different classes of data so there are 19
different values for the correlation variable I use "StatClass". Why would
one want many or few max bins? Does a larger value of duration will put
more memory pressure on the JVM or are the bins accumulated as files on
disk rather than in memory? I am trying to produce large files for HDFS
storage from a stream of many smaller files.


Thanks

MergeContent - Question on interplay of max bins, duration, and size with numerous correlation variable values

Reply via email to