It would buy time but either way it becomes a magic value people have
to know about.  This is not unlike the SplitText scenario where we
recommend doing two-phase splits.  The problem is that for the
ProcessSession we hold information about the flowfiles (not their
content) in memory and the provenance events in memory.  When we're
talking hundreds of thousands or more events in a session that adds up
really quick.  Users should not need to know/worry about this sort of
thing.  We need to have a way to prestage these things to the
respective repositories (provenance/flowfile) so this can go back to
where it belongs as a framework concern.  Easier said that done but a
good goal for us to have.

Peter's use case is a good one to rally around as they way he wanted
it to work is reasonable and intuitive and we should try to make that
happen.

Thanks
Joe

On Tue, Sep 20, 2016 at 5:29 PM, Andy LoPresto <alopre...@apache.org> wrote:
> Bryan,
>
> That’s a good point. Would running with a larger Java heap and higher swap
> threshold allow Peter to get larger batches out?
>
> Andy LoPresto
> alopre...@apache.org
> alopresto.apa...@gmail.com
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Sep 20, 2016, at 1:41 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
> Peter,
>
> Does 10k happen to be your swap threshold in nifi.properties by any chance
> (it defaults to 20k I believe)?
>
> I suspect the behavior you are seeing could be due to the way swapping
> works, but Mark or others could probably confirm.
>
> I found this thread where Mark explained how swapping works with a
> background thread, and I believe it still works this way:
> http://apache-nifi.1125220.n5.nabble.com/Nifi-amp-Spark-receiver-performance-configuration-td524.html
>
> -Bryan
>
> On Tue, Sep 20, 2016 at 10:22 AM, Peter Wicks (pwicks) <pwi...@micron.com>
> wrote:
>>
>> I’m using JSONToSQL, followed by PutSQL.  I’m using Teradata, which
>> supports a special JDBC mode called FastLoad, designed for a minimum of
>> 100,000 rows of data per batch.
>>
>>
>>
>> What I’m finding is that when PutSQL requests a new batch of FlowFiles
>> from the queue, which has over 1 million rows in it, with a batch size of
>> 1000000, it always returns a maximum of 10k.  How can I get my obscenely
>> sized batch request to return all the FlowFile’s I’m asking for?
>>
>>
>>
>> Thanks,
>>
>>   Peter
>
>
>

Reply via email to