I am not. I continued googling for a bit after sending my email and stumbled upon a slide deck by Brian Bende. I think my initial concern looking at it is that it seems to require schema knowledge.
For most of our data sets, we operate in a space where we have a handful of guaranteed fields and who knows what other fields the upstream provider is going to send us. We want to operate on the data in a manner that is non-destructive to unanticipated fields. Is that a blocker for using the RecordReader stuff? On Fri, Aug 10, 2018 at 4:30 PM Joe Witt <[email protected]> wrote: > ben > > are you familiar with the record readers, writers, and associated > processors? > > i suspect if you make a record writer for your custom format at the end of > the flow chain youll get great performance and control. > > thanks > > On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <[email protected]> > wrote: > >> All, I'm seeking some advice on best practices for dealing with FlowFiles >> that contain a large volume of JSON records. >> >> My flow works like this: >> >> Receive a FlowFile with millions of JSON records in it. >> >> Potentially filter out some of the records based on the value of the JSON >> fields. (custom processor uses a regex and a json path to produce a >> "matched" and "not matched" output path) >> >> Potentially split the FlowFile into multiple FlowFiles based on the value >> of one of the JSON fields (custom processor uses a json path and groups >> into output FlowFiles based on the value). >> >> Potentially split the FlowFile into uniformly sized smaller chunks to >> prevent choking downstream systems on the file size (we use SplitText when >> the data is newline delimited, don't currently have a way when the data is >> a JSON array of records) >> >> Strip out some of the JSON fields (using a JoltTransformJSON). >> >> At the end, wrap each JSON record in a proprietary format (custom >> processor wraps each JSON record) >> >> This flow is roughly similar across several different unrelated data sets. >> >> The input data files are occasionally provided in a single JSON array and >> occasionally as newline delimited JSON records. In general, we've found >> newline delimited JSON records far easier to work with because we can >> process them one at a time without loading the entire FlowFile into memory >> (which we have to do for the array variant). >> >> However, if we are to use JoltTransformJSON to strip out or modify some >> of the JSON contents, it appears to only operate on an array (which is >> problematic from the memory footprint standpoint). >> >> We don't really want to break our FlowFiles up into individual JSON >> records as the number of FlowFiles the system would have to handle would be >> orders of magnitudes larger than it is now. >> >> Is our approach of moving towards newline delimited JSON a good one? If >> so, is there anything that would be recommended for replacing >> JoltTransformJSON? Or should we build a custom processor? Or is this a >> reasonable feature request for the JoltTransformJSON processor to support >> new line delimited json? >> >> Or should we be looking into ways to do lazy loading of the JSON arrays >> in our custom processors (I have no clue how easy or hard this would be to >> do)? My little bit of googling suggests this would be difficult. >> >
