Re: Large JSON File Best Practice Question

Benjamin Janssen Fri, 10 Aug 2018 13:47:31 -0700

I am not.  I continued googling for a bit after sending my email and
stumbled upon a slide deck by Brian Bende.  I think my initial concern
looking at it is that it seems to require schema knowledge.


For most of our data sets, we operate in a space where we have a handful of
guaranteed fields and who knows what other fields the upstream provider is
going to send us.  We want to operate on the data in a manner that is
non-destructive to unanticipated fields.  Is that a blocker for using the
RecordReader stuff?

On Fri, Aug 10, 2018 at 4:30 PM Joe Witt <[email protected]> wrote:

> ben
>
> are you familiar with the record readers, writers, and associated
> processors?
>
> i suspect if you make a record writer for your custom format at the end of
> the flow chain youll get great performance and control.
>
> thanks
>
> On Fri, Aug 10, 2018, 4:27 PM Benjamin Janssen <[email protected]>
> wrote:
>
>> All, I'm seeking some advice on best practices for dealing with FlowFiles
>> that contain a large volume of JSON records.
>>
>> My flow works like this:
>>
>> Receive a FlowFile with millions of JSON records in it.
>>
>> Potentially filter out some of the records based on the value of the JSON
>> fields.  (custom processor uses a regex and a json path to produce a
>> "matched" and "not matched" output path)
>>
>> Potentially split the FlowFile into multiple FlowFiles based on the value
>> of one of the JSON fields (custom processor uses a json path and groups
>> into output FlowFiles based on the value).
>>
>> Potentially split the FlowFile into uniformly sized smaller chunks to
>> prevent choking downstream systems on the file size (we use SplitText when
>> the data is newline delimited, don't currently have a way when the data is
>> a JSON array of records)
>>
>> Strip out some of the JSON fields (using a JoltTransformJSON).
>>
>> At the end, wrap each JSON record in a proprietary format (custom
>> processor wraps each JSON record)
>>
>> This flow is roughly similar across several different unrelated data sets.
>>
>> The input data files are occasionally provided in a single JSON array and
>> occasionally as newline delimited JSON records.  In general, we've found
>> newline delimited JSON records far easier to work with because we can
>> process them one at a time without loading the entire FlowFile into memory
>> (which we have to do for the array variant).
>>
>> However, if we are to use JoltTransformJSON to strip out or modify some
>> of the JSON contents, it appears to only operate on an array (which is
>> problematic from the memory footprint standpoint).
>>
>> We don't really want to break our FlowFiles up into individual JSON
>> records as the number of FlowFiles the system would have to handle would be
>> orders of magnitudes larger than it is now.
>>
>> Is our approach of moving towards newline delimited JSON a good one?  If
>> so, is there anything that would be recommended for replacing
>> JoltTransformJSON?  Or should we build a custom processor?  Or is this a
>> reasonable feature request for the JoltTransformJSON processor to support
>> new line delimited json?
>>
>> Or should we be looking into ways to do lazy loading of the JSON arrays
>> in our custom processors (I have no clue how easy or hard this would be to
>> do)?  My little bit of googling suggests this would be difficult.
>>
>

Re: Large JSON File Best Practice Question

Reply via email to