Thanks Andy – lots of food for thought here. I think you are right that with an 
ExecuteStreamCommand processor I would be able to understand enough about my 
flowfile to route it correctly. One thought though is that this does feel a bit 
like hacking around the framework which should ideally know not to do something 
that would/could result in an OOM error. Nevertheless it does feel like I may 
be able to get this idea implemented quite quickly. Nifi is awesome that way.




Olav Jordens
Senior ETL Developer
Two Degrees Mobile Limited
===========================
(M) 022 620 2429
(P) 09 919 7000
www.2degreesmobile.co.nz<http://www.2degreesmobile.co.nz>
[cid:[email protected]]
Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | New 
Zealand |
PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 919 7001

[cid:[email protected]] [cid:[email protected]] 
 [cid:[email protected]]  
[cid:[email protected]]


________________________________

Disclaimer
The e-mail and any files transmitted with it are confidential and may contain 
privileged or copyright information. If you are not the intended recipient you 
must not copy, distribute, or use this e-mail or the information contained in 
it for any purpose other than to notify us of the error. If you have received 
this message in error, please notify the sender immediately, by email or phone 
(+64 9 919 7000) and delete this email from your system. Any views expressed in 
this message are those of the individual sender, except where the sender 
specifically states them to be the views of Two Degrees Mobile Limited. We do 
not guarantee that this material is free from viruses or any other defects 
although due care has been taken to minimize the risk


From: Andy LoPresto [mailto:[email protected]]
Sent: Wednesday, 30 November 2016 3:30 p.m.
To: [email protected]
Subject: Re: Hanging on SplitJSON

It will probably be a little tricky to tune if you don’t know the sizes ahead 
of time, but let’s brainstorm. Assume you get original inputs from 100 B to 100 
MB, where 100 B is 1-2 individual records and 100 MB is 1.1 million records 
(given the same ratio as your earlier example). The simplest way would be to 
just use a SplitText processor to split input into chunks of (for example) 1000 
lines. If the incoming flowfile was contained completely in that size 
threshold, nothing would happen and your flow would continue normally. If the 
flowfile was larger than 1000 lines, it would be split, and there is a low 
likelihood that it would be split exactly on a JSON block boundary. While there 
isn’t a ValidateJSON processor, you can use the JoltTransformJSON processor 
with the “sort” operation to essentially validate the results of the split — if 
it succeeds, it’s valid JSON; if not, try recombining the preceding/following 
flowfile contents.

At some point, as all tricky issues do for me, it comes to ExecuteScript with 
custom logic. Using Groovy and streams, you could scan the file for formed JSON 
blocks (i.e. regex or binary searching on lines to try and read blocks from 
manageable memory chunks) and store these in a memory buffer until you reach a 
configured threshold, then kick out a new flowfile containing n blocks which 
fit in x MB. SplitJSON would then operate on these flowfiles and split the 
individual flowfile containing n blocks into n flowfiles. Because of the 
streaming nature of the ExecuteScript code, you should not encounter OOM 
exceptions, but your throughput would obviously be much slower on large files 
and this would block because it operates serially.

The best solution might be a combination of the two approaches — you could set 
up an ExecuteStreamCommand processor to run “wc -l <flowfile content>” to 
return the count of the number of lines in the file, and then route it to the 
“small/in-memory/fast” isolated SplitJSON vs. the “large/streaming/slow” 
ExecuteScript processor given the number of lines in the file. Even with a 36 
MB file, you could easily load the entire thing in memory and split it into 
30-50 smaller flowfiles with a 1 GB heap as you configured below. Then the 
small flowfiles would be operated on serially, so you would not encounter the 
memory issues.

Hope this helps or at least directs you towards a better solution, given that 
you are more familiar with your specific problem space and the types of 
incoming data.

Andy LoPresto
[email protected]<mailto:[email protected]>
[email protected]<mailto:[email protected]>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Nov 29, 2016, at 5:50 PM, Olav Jordens 
<[email protected]<mailto:[email protected]>> 
wrote:


Thanks Andy – This is a good suggestion, but in my case, the workflow must deal 
with ‘small’ and large JSON files to split and I don’t know in advance which 
ones will cause this problem. I will give it some thought though because it 
does sound like it is a workable way around the problem.



Olav Jordens
Senior ETL Developer
Two Degrees Mobile Limited
===========================
(M) 022 620 2429
(P) 09 919 7000
www.2degreesmobile.co.nz<http://www.2degreesmobile.co.nz/>
<imagedb0c5d.JPG>
Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | New 
Zealand |
PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 919 7001


<imaged25a40.PNG> <image156f20.PNG> <image8b0df7.PNG> <imagef572d1.PNG>
________________________________
Disclaimer
The e-mail and any files transmitted with it are confidential and may contain 
privileged or copyright information. If you are not the intended recipient you 
must not copy, distribute, or use this e-mail or the information contained in 
it for any purpose other than to notify us of the error. If you have received 
this message in error, please notify the sender immediately, by email or phone 
(+64 9 919 7000) and delete this email from your system. Any views expressed in 
this message are those of the individual sender, except where the sender 
specifically states them to be the views of Two Degrees Mobile Limited. We do 
not guarantee that this material is free from viruses or any other defects 
although due care has been taken to minimize the risk


From: Andy LoPresto [mailto:[email protected]]
Sent: Wednesday, 30 November 2016 2:48 p.m.
To: [email protected]<mailto:[email protected]>
Subject: Re: Hanging on SplitJSON

Olav,

Have you tried “stacking” these processors so the initial split breaks the 
complete input into smaller chunks and then each of those are split again? This 
is a common pattern we recommend with splitting or merging from/to large files. 
I don’t know what the overall structure of your original file is, but you 
should be able to use the SplitContent processor to split on boundaries (for 
example, if you know each distinct JSON block starts with the same key (I know 
order is not enforced, but you may have this scenario because all of the blocks 
are in the same file)), and take each flowfile containing 100-1000 JSON objects 
and then route them to the SplitJSON processor.

Andy LoPresto
[email protected]<mailto:[email protected]>
[email protected]<mailto:[email protected]>
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

On Nov 29, 2016, at 5:34 PM, Olav Jordens 
<[email protected]<mailto:[email protected]>> 
wrote:

Joe,

Thanks so much – certainly if it tries to batch this job, I will not have 
enough RAM on my small system, but if the processor would understand that and 
push out batches of splits at a time, then it would work for me. I’ll log the 
JIRA.
Cheers,
Olav


From: Joe Witt [mailto:[email protected]]
Sent: Wednesday, 30 November 2016 2:22 p.m.
To: [email protected]<mailto:[email protected]>
Subject: Re: Hanging on SplitJSON

Olav

We want you to be able to split your 36MB file into 400,000 things and not have 
to stress about this.  Do you mind please filing a JIRA for this to be followed 
up on?  We can definitely do better.

Thanks
Joe

On Tue, Nov 29, 2016 at 8:09 PM, Olav Jordens 
<[email protected]<mailto:[email protected]>> 
wrote:
Hi,

My bad – the problem appears to be that the 36MB JSON file would be split into 
> 400 000 individual records, each carrying a substantial load of attributes. 
This must be causing an out of memory although I could not find such an error 
in the logs – perhaps even the logs were no longer being written to properly!

Thanks,
Olav


From: Olav Jordens 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Wednesday, 30 November 2016 1:25 p.m.
To: [email protected]<mailto:[email protected]>
Subject: Hanging on SplitJSON

Hi,

I have a JSON file of about 36MB which is passed to a SplitJSON processor. This 
processor runs for a while and then my UI hangs. In the app-log the following 
ERRORs pop up:

2016-11-30 13:03:30,999 ERROR [Site-to-Site Worker Thread-393] 
o.a.nifi.remote.SocketRemoteSiteListener Unable to communicate with remote 
instance Peer[url=nifi://localhost:42758] due to 
java.net<http://java.net/>.SocketTimeoutException: Timed out reading from 
socket; closing connection

However, I suspect that this has nothing to do with Site-to-Site (from my 
single nifi instance to itself) as there are no ERRORs prior to my flowfile 
hitting the SplitJSON processor, and every time I re-run, it is at this point 
that it hangs. My java Xmx=1024m and Xms=1024m. When I do a nifi dump:

bin/nifi.sh dump
nifi.sh: JAVA_HOME not set; results may vary

Java home:
NiFi home: /app/HDF-2.0.1.0/nifi

Bootstrap Config File: /app/HDF-2.0.1.0/nifi/conf/bootstrap.conf

Exception in thread "main" java.net<http://java.net/>.SocketTimeoutException: 
Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at org.apache.nifi.bootstrap.RunNiFi.dump(RunNiFi.java:695)
        at org.apache.nifi.bootstrap.RunNiFi.main(RunNiFi.java:225)

This again points at a socket issue, but my main confusion is why this error 
occurs every time the flowfile hits the SplitJSON processor?

The status indicates that it is hanging and not responding to ping requests:

service nifi status
nifi.sh: JAVA_HOME not set; results may vary

Java home:
NiFi home: /app/HDF-2.0.1.0/nifi

Bootstrap Config File: /app/HDF-2.0.1.0/nifi/conf/bootstrap.conf

2016-11-30 13:23:31,786 INFO [main] org.apache.nifi.bootstrap.Command Apache 
NiFi is running at PID 23080 but is not responding to ping requests

Any ideas?

Thanks,
Olav


Olav Jordens
Senior ETL Developer
Two Degrees Mobile Limited
===========================
(M) 022 620 2429
(P) 09 919 7000
www.2degreesmobile.co.nz<http://www.2degreesmobile.co.nz/>
<image001.jpg>
Two Degrees Mobile Limited | 47-49 George Street | Newmarket | Auckland | New 
Zealand |
PO Box 8355 | Symonds Street | Auckland 1150 | New Zealand | Fax +64 9 919 
7001<tel:+64%209-919%207001>

<image002.png> <image003.png> <image004.png> <image005.png>
________________________________
Disclaimer
The e-mail and any files transmitted with it are confidential and may contain 
privileged or copyright information. If you are not the intended recipient you 
must not copy, distribute, or use this e-mail or the information contained in 
it for any purpose other than to notify us of the error. If you have received 
this message in error, please notify the sender immediately, by email or phone 
(+64 9 919 7000<tel:+64%209-919%207000>) and delete this email from your 
system. Any views expressed in this message are those of the individual sender, 
except where the sender specifically states them to be the views of Two Degrees 
Mobile Limited. We do not guarantee that this material is free from viruses or 
any other defects although due care has been taken to minimize the risk


Reply via email to