Re: Large JSON File Best Practice Question

Joe Witt Fri, 10 Aug 2018 13:30:54 -0700

ben

are you familiar with the record readers, writers, and associated
processors?


i suspect if you make a record writer for your custom format at the end of
the flow chain youll get great performance and control.

thanks

On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <[email protected]> wrote:

> All, I'm seeking some advice on best practices for dealing with FlowFiles
> that contain a large volume of JSON records.
>
> My flow works like this:
>
> Receive a FlowFile with millions of JSON records in it.
>
> Potentially filter out some of the records based on the value of the JSON
> fields.  (custom processor uses a regex and a json path to produce a
> "matched" and "not matched" output path)
>
> Potentially split the FlowFile into multiple FlowFiles based on the value
> of one of the JSON fields (custom processor uses a json path and groups
> into output FlowFiles based on the value).
>
> Potentially split the FlowFile into uniformly sized smaller chunks to
> prevent choking downstream systems on the file size (we use SplitText when
> the data is newline delimited, don't currently have a way when the data is
> a JSON array of records)
>
> Strip out some of the JSON fields (using a JoltTransformJSON).
>
> At the end, wrap each JSON record in a proprietary format (custom
> processor wraps each JSON record)
>
> This flow is roughly similar across several different unrelated data sets.
>
> The input data files are occasionally provided in a single JSON array and
> occasionally as newline delimited JSON records.  In general, we've found
> newline delimited JSON records far easier to work with because we can
> process them one at a time without loading the entire FlowFile into memory
> (which we have to do for the array variant).
>
> However, if we are to use JoltTransformJSON to strip out or modify some of
> the JSON contents, it appears to only operate on an array (which is
> problematic from the memory footprint standpoint).
>
> We don't really want to break our FlowFiles up into individual JSON
> records as the number of FlowFiles the system would have to handle would be
> orders of magnitudes larger than it is now.
>
> Is our approach of moving towards newline delimited JSON a good one?  If
> so, is there anything that would be recommended for replacing
> JoltTransformJSON?  Or should we build a custom processor?  Or is this a
> reasonable feature request for the JoltTransformJSON processor to support
> new line delimited json?
>
> Or should we be looking into ways to do lazy loading of the JSON arrays in
> our custom processors (I have no clue how easy or hard this would be to
> do)?  My little bit of googling suggests this would be difficult.
>

Re: Large JSON File Best Practice Question

Reply via email to