We should just make it so flow file references cam be swapped out amd the Json reading is loading into Mem too. The memory killer is the Json being fully read in and then the 400000 flow file objects.
Not reading the original Json fully into Mem is a processor/library fix and fixing handling of in session flow file tracking of large batches is a framework thing. In the mean time it can of course be worked around with execute command or script processors. Thanks Joe On Nov 29, 2016 8:28 PM, "Andy LoPresto" <[email protected]> wrote: > It will probably be a little tricky to tune if you don’t know the sizes > ahead of time, but let’s brainstorm. Assume you get original inputs from > 100 B to 100 MB, where 100 B is 1-2 individual records and 100 MB is 1.1 > million records (given the same ratio as your earlier example). The > simplest way would be to just use a SplitText processor to split input into > chunks of (for example) 1000 lines. If the incoming flowfile was contained > completely in that size threshold, nothing would happen and your flow would > continue normally. If the flowfile was larger than 1000 lines, it would be > split, and there is a low likelihood that it would be split exactly on a > JSON block boundary. While there isn’t a ValidateJSON processor, you can > use the JoltTransformJSON processor with the “sort” operation to > essentially validate the results of the split — if it succeeds, it’s valid > JSON; if not, try recombining the preceding/following flowfile contents. > > At some point, as all tricky issues do for me, it comes to ExecuteScript > with custom logic. Using Groovy and streams, you could scan the file for > formed JSON blocks (i.e. regex or binary searching on lines to try and read > blocks from manageable memory chunks) and store these in a memory buffer > until you reach a configured threshold, then kick out a new flowfile > containing n blocks which fit in x MB. SplitJSON would then operate on > these flowfiles and split the individual flowfile containing n blocks into > n flowfiles. Because of the streaming nature of the ExecuteScript code, you > should not encounter OOM exceptions, but your throughput would obviously be > much slower on large files and this would block because it operates > serially. > > The best solution might be a combination of the two approaches — you could > set up an ExecuteStreamCommand processor to run “wc -l <flowfile content>” > to return the count of the number of lines in the file, and then route it > to the “small/in-memory/fast” isolated SplitJSON vs. the > “large/streaming/slow” ExecuteScript processor given the number of lines in > the file. Even with a 36 MB file, you could easily load the entire thing in > memory and split it into 30-50 smaller flowfiles with a 1 GB heap as you > configured below. Then the small flowfiles would be operated on serially, > so you would not encounter the memory issues. > > Hope this helps or at least directs you towards a better solution, given > that you are more familiar with your specific problem space and the types > of incoming data. > > Andy LoPresto > [email protected] > *[email protected] <[email protected]>* > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > On Nov 29, 2016, at 5:50 PM, Olav Jordens <Olav.Jordens@2degreesmobile. > co.nz> wrote: > > > Thanks Andy – This is a good suggestion, but in my case, the workflow must > deal with ‘small’ and large JSON files to split and I don’t know in advance > which ones will cause this problem. I will give it some thought though > because it does sound like it is a workable way around the problem. > > > > Olav Jordens > Senior ETL Developer > Two Degrees Mobile Limited > =========================== > (M) 022 620 2429 > (P) 09 919 7000 > www.2degreesmobile.co.nz > <imagedb0c5d.JPG> > Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | > New Zealand | > PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 > 919 7001 <+64%209-919%207001> > > > <imaged25a40.PNG> <image156f20.PNG> <image8b0df7.PNG> <imagef572d1.PNG> > > ------------------------------ > > Disclaimer > The e-mail and any files transmitted with it are confidential and may > contain privileged or copyright information. If you are not the intended > recipient you must not copy, distribute, or use this e-mail or the > information contained in it for any purpose other than to notify us of the > error. If you have received this message in error, please notify the sender > immediately, by email or phone (+64 9 919 7000 <+64%209-919%207000>) and > delete this email from your system. Any views expressed in this message are > those of the individual sender, except where the sender specifically states > them to be the views of Two Degrees Mobile Limited. We do not guarantee > that this material is free from viruses or any other defects although due > care has been taken to minimize the risk > > > *From:* Andy LoPresto [mailto:[email protected] <[email protected]>] > > *Sent:* Wednesday, 30 November 2016 2:48 p.m. > *To:* [email protected] > *Subject:* Re: Hanging on SplitJSON > > Olav, > > Have you tried “stacking” these processors so the initial split breaks the > complete input into smaller chunks and then each of those are split again? > This is a common pattern we recommend with splitting or merging from/to > large files. I don’t know what the overall structure of your original file > is, but you should be able to use the SplitContent processor to split on > boundaries (for example, if you know each distinct JSON block starts with > the same key (I know order is not enforced, but you may have this scenario > because all of the blocks are in the same file)), and take each flowfile > containing 100-1000 JSON objects and then route them to the SplitJSON > processor. > > Andy LoPresto > [email protected] > *[email protected] <[email protected]>* > PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 > > > On Nov 29, 2016, at 5:34 PM, Olav Jordens <Olav.Jordens@2degreesmobile. > co.nz> wrote: > > Joe, > > Thanks so much – certainly if it tries to batch this job, I will not have > enough RAM on my small system, but if the processor would understand that > and push out batches of splits at a time, then it would work for me. I’ll > log the JIRA. > Cheers, > Olav > > > *From:* Joe Witt [mailto:[email protected] <[email protected]>] > *Sent:* Wednesday, 30 November 2016 2:22 p.m. > *To:* [email protected] > *Subject:* Re: Hanging on SplitJSON > > Olav > > We want you to be able to split your 36MB file into 400,000 things and not > have to stress about this. Do you mind please filing a JIRA for this to be > followed up on? We can definitely do better. > > Thanks > Joe > > On Tue, Nov 29, 2016 at 8:09 PM, Olav Jordens < > [email protected]> wrote: > > Hi, > > My bad – the problem appears to be that the 36MB JSON file would be split > into > 400 000 individual records, each carrying a substantial load of > attributes. This must be causing an out of memory although I could not find > such an error in the logs – perhaps even the logs were no longer being > written to properly! > > Thanks, > Olav > > > *From:* Olav Jordens [mailto:[email protected]] > *Sent:* Wednesday, 30 November 2016 1:25 p.m. > *To:* [email protected] > *Subject:* Hanging on SplitJSON > > Hi, > > I have a JSON file of about 36MB which is passed to a SplitJSON processor. > This processor runs for a while and then my UI hangs. In the app-log the > following ERRORs pop up: > > 2016-11-30 13:03:30,999 ERROR [Site-to-Site Worker Thread-393] > o.a.nifi.remote.SocketRemoteSiteListener Unable to communicate with > remote instance Peer[url=nifi://localhost:42758] due to > java.net.SocketTimeoutException: > Timed out reading from socket; closing connection > > However, I suspect that this has nothing to do with Site-to-Site (from my > single nifi instance to itself) as there are no ERRORs prior to my flowfile > hitting the SplitJSON processor, and every time I re-run, it is at this > point that it hangs. My java Xmx=1024m and Xms=1024m. When I do a nifi dump: > > bin/nifi.sh dump > nifi.sh: JAVA_HOME not set; results may vary > > Java home: > NiFi home: /app/HDF-2.0.1.0/nifi > > Bootstrap Config File: /app/HDF-2.0.1.0/nifi/conf/bootstrap.conf > > Exception in thread "main" java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.socketRead(SocketInputStream. > java:116) > at java.net.SocketInputStream.read(SocketInputStream.java:170) > at java.net.SocketInputStream.read(SocketInputStream.java:141) > at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at org.apache.nifi.bootstrap.RunNiFi.dump(RunNiFi.java:695) > at org.apache.nifi.bootstrap.RunNiFi.main(RunNiFi.java:225) > > This again points at a socket issue, but my main confusion is why this > error occurs every time the flowfile hits the SplitJSON processor? > > The status indicates that it is hanging and not responding to ping > requests: > > service nifi status > nifi.sh: JAVA_HOME not set; results may vary > > Java home: > NiFi home: /app/HDF-2.0.1.0/nifi > > Bootstrap Config File: /app/HDF-2.0.1.0/nifi/conf/bootstrap.conf > > 2016-11-30 13:23:31,786 INFO [main] org.apache.nifi.bootstrap.Command > Apache NiFi is running at PID 23080 but is not responding to ping requests > > Any ideas? > > Thanks, > Olav > > > > *Olav Jordens* > > > > > > *Senior ETL DeveloperTwo Degrees Mobile > Limited===========================(M) 022 620 2429(P) 09 919 > 7000www.2degreesmobile.co.nz <http://www.2degreesmobile.co.nz/>* > <image001.jpg> > Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | > New Zealand | > PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 > 919 7001 <+64%209-919%207001> > > > <image002.png> <image003.png> <image004.png> <image005.png> > ------------------------------ > Disclaimer > The e-mail and any files transmitted with it are confidential and may > contain privileged or copyright information. If you are not the intended > recipient you must not copy, distribute, or use this e-mail or the > information contained in it for any purpose other than to notify us of the > error. If you have received this message in error, please notify the sender > immediately, by email or phone (+64 9 919 7000 <+64%209-919%207000>) and > delete this email from your system. Any views expressed in this message are > those of the individual sender, except where the sender specifically states > them to be the views of Two Degrees Mobile Limited. We do not guarantee > that this material is free from viruses or any other defects although due > care has been taken to minimize the risk > > >
