ben are you familiar with the record readers, writers, and associated processors?
i suspect if you make a record writer for your custom format at the end of the flow chain youll get great performance and control. thanks On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <[email protected]> wrote: > All, I'm seeking some advice on best practices for dealing with FlowFiles > that contain a large volume of JSON records. > > My flow works like this: > > Receive a FlowFile with millions of JSON records in it. > > Potentially filter out some of the records based on the value of the JSON > fields. (custom processor uses a regex and a json path to produce a > "matched" and "not matched" output path) > > Potentially split the FlowFile into multiple FlowFiles based on the value > of one of the JSON fields (custom processor uses a json path and groups > into output FlowFiles based on the value). > > Potentially split the FlowFile into uniformly sized smaller chunks to > prevent choking downstream systems on the file size (we use SplitText when > the data is newline delimited, don't currently have a way when the data is > a JSON array of records) > > Strip out some of the JSON fields (using a JoltTransformJSON). > > At the end, wrap each JSON record in a proprietary format (custom > processor wraps each JSON record) > > This flow is roughly similar across several different unrelated data sets. > > The input data files are occasionally provided in a single JSON array and > occasionally as newline delimited JSON records. In general, we've found > newline delimited JSON records far easier to work with because we can > process them one at a time without loading the entire FlowFile into memory > (which we have to do for the array variant). > > However, if we are to use JoltTransformJSON to strip out or modify some of > the JSON contents, it appears to only operate on an array (which is > problematic from the memory footprint standpoint). > > We don't really want to break our FlowFiles up into individual JSON > records as the number of FlowFiles the system would have to handle would be > orders of magnitudes larger than it is now. > > Is our approach of moving towards newline delimited JSON a good one? If > so, is there anything that would be recommended for replacing > JoltTransformJSON? Or should we build a custom processor? Or is this a > reasonable feature request for the JoltTransformJSON processor to support > new line delimited json? > > Or should we be looking into ways to do lazy loading of the JSON arrays in > our custom processors (I have no clue how easy or hard this would be to > do)? My little bit of googling suggests this would be difficult. >
