Hello,
I've been experimenting with NiFi and MongoDB. I have a test collection with 1
million documents in it. Each document has the same flat JSON structure with 11
fields.
My NiFi flow exposes a webservice, which allows the user to fetch all the data
in CSV format.
However, 1M documents brings NiFi to its knees. Even after increasing the JVM's
Xms and Xmx to 2G, I still get an OutOfMemoryError:
2018-06-20 11:27:43,428 WARN [Timer-Driven Process Thread-7]
o.a.n.controller.tasks.ConnectableTask Admng.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at
java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at
org.apache.nifi.processors.mongodb.GetMongo.buildBatch(GetMongo.java:222)
at
org.apache.nifi.processors.mongodb.GetMongo.onTrigger(GetMongo.java:341)
at
org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
at
org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1147)
at
org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:175)
at
org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenScheduling
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThr
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPool
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
I dug into the code, and discovered that the GetMongo processor takes all the
Documents returned from Mongo, converts them to Strings, and concatenates them
in a StringBuilder.
My question is thus: is there a better way that I should be doing this?
The only idea I've had is to use a smaller batch size, but that would mean that
I'd just need a later processor to concatenate the batches in order to get one
big CSV.
Is there some sort of "GetMongoRecord" processor that reads each mongo Document
as a record, in the way ExecuteSQL does? (I've done the same test with an SQL
database, and it handles 1M records just fine.)
Thanks for your help,
Kelsey
Suite ? l'?volution des dispositifs de r?glementation du travail, si vous
recevez ce mail avant 7h00, en soir?e, durant le week-end ou vos cong?s merci,
sauf cas d'urgence exceptionnelle, de ne pas le traiter ni d'y r?pondre
imm?diatement.