Re: SplitText processor OOM larger input files

2017-06-02 Thread Mark Payne
Martin, I agree with Andrew on the point of a single virtual core not being much. I've not really dealt with Google Compute Cloud personally but on the AWS t1.micro instances, which offer a similar VM, I don't expect much out of it. That being said, let's look a bit deeper at some of the

Re: SplitText processor OOM larger input files

2017-06-02 Thread Andrew Grande
1 vcore, which is not even a full core (a shared and oversubscribed cpu core). I'm not sure what you expected to see when you raised concurrency to 10 :) There's a lot of things NiFi is doing behind the scenes, especially around provenance recording. I don't recommend anything below 4 cores to

Re: SplitText processor OOM larger input files

2017-06-02 Thread Martin Eden
Thanks Andrew, I have added UpdateAttribute processors to update the file names like you said. Now it works, writing out 1MB files at a time (updated the MergeContent MaxNumberOfEntries to 1 to achieve that since each line in my csv is 100 bytes). The current flow is: ListHDFS -> FetchHDFS

Re: SplitText processor OOM larger input files

2017-06-01 Thread Andrew Grande
It looks like your max bin size is 1000 and 10MB. Every time you hit those, it will write out a merged file. Update tge filename attribute to be unique before writing via PutHDFS. Andrew On Thu, Jun 1, 2017, 2:24 AM Martin Eden wrote: > Hi Joe, > > Thanks for the

Re: SplitText processor OOM larger input files

2017-06-01 Thread Martin Eden
Hi Joe, Thanks for the explanations. Really useful in understanding how it works. Good to know that in the future this will be improved. About the appending to HDFS issue let me recap. My flow is: ListHDFS -> FetchHDFS -> UnpackContent -> SplitText(5000) -> SplitText(1) -> RouteOnContent ->

Re: SplitText processor OOM larger input files

2017-05-31 Thread Joe Witt
Split failed before even with backpressure: - yes that backpressure kicks in when destination queues for a given processor have reached their target size (in count of flowfiles or total size represented). However, to clarify why the OOM happened it is important to realize that it is not about

Re: SplitText processor OOM larger input files

2017-05-31 Thread Martin Eden
Hi Koji, Good to know that it can handle large files. I thought it was the case but I was just not seeing in practice. Yes I am using 'Line Split Count' as 1 at SplitText. I added the extra SplitText processor exactly as you suggested and the OOM went away. So, big thanks!!! However I have 2

Re: SplitText processor OOM larger input files

2017-05-31 Thread Koji Kawamura
Hi Martin, Generally, NiFi processor doesn't load entire content of file and is capable of handling huge files. However, having massive amount of FlowFiles can cause OOM issue as FlowFiles and its Attributes resides on heap. I assume you are using 'Line Split Count' as 1 at SplitText. We

SplitText processor OOM larger input files

2017-05-31 Thread Martin Eden
Hi all, I have a vanilla Nifi 1.2.0 node with 1GB of heap. The flow I am trying to run is: ListHDFS -> FetchHDFS -> SplitText -> RouteOnContent -> MergeContent -> PutHDFS When I give it a 300MB input zip file (2.5GB uncompressed) I am getting Java OutOfMemoryError as below. Does NiFi read in