Hello,

I've been experimenting with NiFi and MongoDB. I have a test collection with 1 
million documents in it. Each document has the same flat JSON structure with 11 
fields.
My NiFi flow exposes a webservice, which allows the user to fetch all the data 
in CSV format.

However, 1M documents brings NiFi to its knees. Even after increasing the JVM's 
Xms and Xmx to 2G, I still get an OutOfMemoryError:

2018-06-20 11:27:43,428 WARN [Timer-Driven Process Thread-7] 
o.a.n.controller.tasks.ConnectableTask Admng.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:3332)
        at 
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
        at 
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
        at java.lang.StringBuilder.append(StringBuilder.java:136)
        at 
org.apache.nifi.processors.mongodb.GetMongo.buildBatch(GetMongo.java:222)
        at 
org.apache.nifi.processors.mongodb.GetMongo.onTrigger(GetMongo.java:341)
        at 
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
        at 
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1147)
        at 
org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:175)
        at 
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenScheduling
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThr
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPool
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

I dug into the code, and discovered that the GetMongo processor takes all the 
Documents returned from Mongo, converts them to Strings, and concatenates them 
in a StringBuilder.

My question is thus: is there a better way that I should be doing this?
The only idea I've had is to use a smaller batch size, but that would mean that 
I'd just need a later processor to concatenate the batches in order to get one 
big CSV.
Is there some sort of "GetMongoRecord" processor that reads each mongo Document 
as a record, in the way ExecuteSQL does? (I've done the same test with an SQL 
database, and it handles 1M records just fine.)

Thanks for your help,

Kelsey
Suite ? l'?volution des dispositifs de r?glementation du travail, si vous 
recevez ce mail avant 7h00, en soir?e, durant le week-end ou vos cong?s merci, 
sauf cas d'urgence exceptionnelle, de ne pas le traiter ni d'y r?pondre 
imm?diatement.

Reply via email to