Thanks Andy – lots of food for thought here. I think you are right that with an ExecuteStreamCommand processor I would be able to understand enough about my flowfile to route it correctly. One thought though is that this does feel a bit like hacking around the framework which should ideally know not to do something that would/could result in an OOM error. Nevertheless it does feel like I may be able to get this idea implemented quite quickly. Nifi is awesome that way.
Olav Jordens Senior ETL Developer Two Degrees Mobile Limited =========================== (M) 022 620 2429 (P) 09 919 7000 www.2degreesmobile.co.nz<http://www.2degreesmobile.co.nz> [cid:[email protected]] Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | New Zealand | PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 919 7001 [cid:[email protected]] [cid:[email protected]] [cid:[email protected]] [cid:[email protected]] ________________________________ Disclaimer The e-mail and any files transmitted with it are confidential and may contain privileged or copyright information. If you are not the intended recipient you must not copy, distribute, or use this e-mail or the information contained in it for any purpose other than to notify us of the error. If you have received this message in error, please notify the sender immediately, by email or phone (+64 9 919 7000) and delete this email from your system. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Two Degrees Mobile Limited. We do not guarantee that this material is free from viruses or any other defects although due care has been taken to minimize the risk From: Andy LoPresto [mailto:[email protected]] Sent: Wednesday, 30 November 2016 3:30 p.m. To: [email protected] Subject: Re: Hanging on SplitJSON It will probably be a little tricky to tune if you don’t know the sizes ahead of time, but let’s brainstorm. Assume you get original inputs from 100 B to 100 MB, where 100 B is 1-2 individual records and 100 MB is 1.1 million records (given the same ratio as your earlier example). The simplest way would be to just use a SplitText processor to split input into chunks of (for example) 1000 lines. If the incoming flowfile was contained completely in that size threshold, nothing would happen and your flow would continue normally. If the flowfile was larger than 1000 lines, it would be split, and there is a low likelihood that it would be split exactly on a JSON block boundary. While there isn’t a ValidateJSON processor, you can use the JoltTransformJSON processor with the “sort” operation to essentially validate the results of the split — if it succeeds, it’s valid JSON; if not, try recombining the preceding/following flowfile contents. At some point, as all tricky issues do for me, it comes to ExecuteScript with custom logic. Using Groovy and streams, you could scan the file for formed JSON blocks (i.e. regex or binary searching on lines to try and read blocks from manageable memory chunks) and store these in a memory buffer until you reach a configured threshold, then kick out a new flowfile containing n blocks which fit in x MB. SplitJSON would then operate on these flowfiles and split the individual flowfile containing n blocks into n flowfiles. Because of the streaming nature of the ExecuteScript code, you should not encounter OOM exceptions, but your throughput would obviously be much slower on large files and this would block because it operates serially. The best solution might be a combination of the two approaches — you could set up an ExecuteStreamCommand processor to run “wc -l <flowfile content>” to return the count of the number of lines in the file, and then route it to the “small/in-memory/fast” isolated SplitJSON vs. the “large/streaming/slow” ExecuteScript processor given the number of lines in the file. Even with a 36 MB file, you could easily load the entire thing in memory and split it into 30-50 smaller flowfiles with a 1 GB heap as you configured below. Then the small flowfiles would be operated on serially, so you would not encounter the memory issues. Hope this helps or at least directs you towards a better solution, given that you are more familiar with your specific problem space and the types of incoming data. Andy LoPresto [email protected]<mailto:[email protected]> [email protected]<mailto:[email protected]> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 On Nov 29, 2016, at 5:50 PM, Olav Jordens <[email protected]<mailto:[email protected]>> wrote: Thanks Andy – This is a good suggestion, but in my case, the workflow must deal with ‘small’ and large JSON files to split and I don’t know in advance which ones will cause this problem. I will give it some thought though because it does sound like it is a workable way around the problem. Olav Jordens Senior ETL Developer Two Degrees Mobile Limited =========================== (M) 022 620 2429 (P) 09 919 7000 www.2degreesmobile.co.nz<http://www.2degreesmobile.co.nz/> <imagedb0c5d.JPG> Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | New Zealand | PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 919 7001 <imaged25a40.PNG> <image156f20.PNG> <image8b0df7.PNG> <imagef572d1.PNG> ________________________________ Disclaimer The e-mail and any files transmitted with it are confidential and may contain privileged or copyright information. If you are not the intended recipient you must not copy, distribute, or use this e-mail or the information contained in it for any purpose other than to notify us of the error. If you have received this message in error, please notify the sender immediately, by email or phone (+64 9 919 7000) and delete this email from your system. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Two Degrees Mobile Limited. We do not guarantee that this material is free from viruses or any other defects although due care has been taken to minimize the risk From: Andy LoPresto [mailto:[email protected]] Sent: Wednesday, 30 November 2016 2:48 p.m. To: [email protected]<mailto:[email protected]> Subject: Re: Hanging on SplitJSON Olav, Have you tried “stacking” these processors so the initial split breaks the complete input into smaller chunks and then each of those are split again? This is a common pattern we recommend with splitting or merging from/to large files. I don’t know what the overall structure of your original file is, but you should be able to use the SplitContent processor to split on boundaries (for example, if you know each distinct JSON block starts with the same key (I know order is not enforced, but you may have this scenario because all of the blocks are in the same file)), and take each flowfile containing 100-1000 JSON objects and then route them to the SplitJSON processor. Andy LoPresto [email protected]<mailto:[email protected]> [email protected]<mailto:[email protected]> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4 BACE 3C6E F65B 2F7D EF69 On Nov 29, 2016, at 5:34 PM, Olav Jordens <[email protected]<mailto:[email protected]>> wrote: Joe, Thanks so much – certainly if it tries to batch this job, I will not have enough RAM on my small system, but if the processor would understand that and push out batches of splits at a time, then it would work for me. I’ll log the JIRA. Cheers, Olav From: Joe Witt [mailto:[email protected]] Sent: Wednesday, 30 November 2016 2:22 p.m. To: [email protected]<mailto:[email protected]> Subject: Re: Hanging on SplitJSON Olav We want you to be able to split your 36MB file into 400,000 things and not have to stress about this. Do you mind please filing a JIRA for this to be followed up on? We can definitely do better. Thanks Joe On Tue, Nov 29, 2016 at 8:09 PM, Olav Jordens <[email protected]<mailto:[email protected]>> wrote: Hi, My bad – the problem appears to be that the 36MB JSON file would be split into > 400 000 individual records, each carrying a substantial load of attributes. This must be causing an out of memory although I could not find such an error in the logs – perhaps even the logs were no longer being written to properly! Thanks, Olav From: Olav Jordens [mailto:[email protected]<mailto:[email protected]>] Sent: Wednesday, 30 November 2016 1:25 p.m. To: [email protected]<mailto:[email protected]> Subject: Hanging on SplitJSON Hi, I have a JSON file of about 36MB which is passed to a SplitJSON processor. This processor runs for a while and then my UI hangs. In the app-log the following ERRORs pop up: 2016-11-30 13:03:30,999 ERROR [Site-to-Site Worker Thread-393] o.a.nifi.remote.SocketRemoteSiteListener Unable to communicate with remote instance Peer[url=nifi://localhost:42758] due to java.net<http://java.net/>.SocketTimeoutException: Timed out reading from socket; closing connection However, I suspect that this has nothing to do with Site-to-Site (from my single nifi instance to itself) as there are no ERRORs prior to my flowfile hitting the SplitJSON processor, and every time I re-run, it is at this point that it hangs. My java Xmx=1024m and Xms=1024m. When I do a nifi dump: bin/nifi.sh dump nifi.sh: JAVA_HOME not set; results may vary Java home: NiFi home: /app/HDF-2.0.1.0/nifi Bootstrap Config File: /app/HDF-2.0.1.0/nifi/conf/bootstrap.conf Exception in thread "main" java.net<http://java.net/>.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:170) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) at java.io.InputStreamReader.read(InputStreamReader.java:184) at java.io.BufferedReader.fill(BufferedReader.java:161) at java.io.BufferedReader.readLine(BufferedReader.java:324) at java.io.BufferedReader.readLine(BufferedReader.java:389) at org.apache.nifi.bootstrap.RunNiFi.dump(RunNiFi.java:695) at org.apache.nifi.bootstrap.RunNiFi.main(RunNiFi.java:225) This again points at a socket issue, but my main confusion is why this error occurs every time the flowfile hits the SplitJSON processor? The status indicates that it is hanging and not responding to ping requests: service nifi status nifi.sh: JAVA_HOME not set; results may vary Java home: NiFi home: /app/HDF-2.0.1.0/nifi Bootstrap Config File: /app/HDF-2.0.1.0/nifi/conf/bootstrap.conf 2016-11-30 13:23:31,786 INFO [main] org.apache.nifi.bootstrap.Command Apache NiFi is running at PID 23080 but is not responding to ping requests Any ideas? Thanks, Olav Olav Jordens Senior ETL Developer Two Degrees Mobile Limited =========================== (M) 022 620 2429 (P) 09 919 7000 www.2degreesmobile.co.nz<http://www.2degreesmobile.co.nz/> <image001.jpg> Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | New Zealand | PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 919 7001<tel:+64%209-919%207001> <image002.png> <image003.png> <image004.png> <image005.png> ________________________________ Disclaimer The e-mail and any files transmitted with it are confidential and may contain privileged or copyright information. If you are not the intended recipient you must not copy, distribute, or use this e-mail or the information contained in it for any purpose other than to notify us of the error. If you have received this message in error, please notify the sender immediately, by email or phone (+64 9 919 7000<tel:+64%209-919%207000>) and delete this email from your system. Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Two Degrees Mobile Limited. We do not guarantee that this material is free from viruses or any other defects although due care has been taken to minimize the risk
