Re: Data Ingestion forLarge Source Files and Masking

2016-01-22 Thread Joe Witt
Thanks Obaid! On Fri, Jan 22, 2016 at 11:51 PM, obaidul karim wrote: > Hi Joe, > > I have created a JIRA NIFI-1432 as a new feature request (Efficient CSV > processor) with some recommendation and sharing my own code. > > > -Obaid > > On Thu, Jan 14, 2016 at 2:14 PM, obaidul karim wrote: >> >>

Re: Data Ingestion forLarge Source Files and Masking

2016-01-22 Thread obaidul karim
Hi Joe, I have created a JIRA NIFI-1432 as a new feature request (Efficient CSV processor) with some recommendation and sharing my own code. -Obaid On Thu, Jan 14, 2016 at 2:14 PM, obaidul karim wrote: > Joe, > I am doing some optimizations o

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread obaidul karim
Joe, I am doing some optimizations on my csv processing. Let clear them out then I will share the final version. -Obaid On Thursday, January 14, 2016, Joe Witt wrote: > Quick observation for now off latest data: > - GC looks pretty good though it is surprising there were any full GCs > during t

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread Joe Witt
Quick observation for now off latest data: - GC looks pretty good though it is surprising there were any full GCs during that short test - cpu has low utilization - disk has low utilization Can you share your sample input data, processor code, flow as a template? Attaching to a JIRA for example c

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread obaidul karim
Joe, Last time it was below: java.arg.2=-Xms512m java.arg.3=-Xmx512m Now I made as below: java.arg.2=-Xms5120m java.arg.3=-Xmx10240m latest jstate & iostate output are attached. To me it is still slow, no significant improvements. -Obaid On Thu, Jan 14, 2016 at 12:41 PM, Joe Witt wrote: > Ob

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread Joe Witt
Obaid, Great so this is helpful info. Iostat output shows both CPU and disk are generally bored and ready for more work. Looking at the gc output though suggests trouble. We see there are 32 samples at 1 second spread each and in that time spent more than 6 seconds of it doing garbage collectio

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread obaidul karim
Hi Joe, Please find attached jstat & iostat output. So far it seems to me that it is CPU bound. However, your eyes are better tan mine :). -Obaid On Thu, Jan 14, 2016 at 11:51 AM, Joe Witt wrote: > Hello > > Let's narrow in on potential issues. So while this process is running > and appears

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread Joe Witt
Hello Let's narrow in on potential issues. So while this process is running and appears sluggish in nature please run the following on the command line 'jps' This command will tell you the process id of NiFi. You'll want the pid associated with the Java process other than what is called 'jps'

Re: Data Ingestion forLarge Source Files and Masking

2016-01-13 Thread obaidul karim
Hi Joe & Others, Thanks for all of your suggestions. Now I am using below code: 1. Buffered reader (I tried to use NLKBufferedReader, but it requires too many libs & Nifi failed to start. I was lost.) 2. Buffered writer 3. Using appending line end instead to concat new line Still no performance

Re: Data Ingestion forLarge Source Files and Masking

2016-01-12 Thread Joe Witt
Hello So the performance went from what sounded pretty good to what sounds pretty problematic. The rate now sounds like it is around 5MB/s which is indeed quite poor. Building on what Bryan said there does appear to be some good opportunities to improve the performance. The link he provided jus

Re: Data Ingestion forLarge Source Files and Masking

2016-01-12 Thread Juan Sequeiros
Obaid, Since you mention that you will have dedicated ETL servers and assume they will also have a decent amount of ram on them, then I would not shy away from increasing your threads. Also in your staging directory if you do not need to keep originals, then might consider GetFile and on that one

Re: Data Ingestion forLarge Source Files and Masking

2016-01-12 Thread Bryan Bende
Obaid, I can't say for sure how much this would improve performance, but you might want to wrap the OutputStream with BufferedOutputStream or BufferedWriter. Would be curious to here if that helps. A similar scenario from the standard processors is ReplaceText, here is one example where it uses t

Re: Data Ingestion forLarge Source Files and Masking

2016-01-12 Thread obaidul karim
Hi Joe, Yes, I took consideration of existinh RAID and HW settings. We have 10G NIC for all hadoop intra-connectivity and the server in question is an edge node of our hadoop cluster. In production scenario we will use dedicated ETL servers having high performance(>500MB/s) local disks. Sharing a

Re: Data Ingestion forLarge Source Files and Masking

2016-01-04 Thread Joe Witt
Obaid, Really happy you're seeing the performance you need. That works out to about 110MB/s on average over that period. Any chance you have a 1GB NIC? If you really want to have fun with performance tuning you can use things like iostat and other commands to observe disk, network, cpu. Someth

Re: Data Ingestion forLarge Source Files and Masking

2016-01-04 Thread obaidul karim
Hi Joe, Just completed by test with 100GB data (on a local RAID 5 disk on a single server). I was able to load 100GB data within 15 minutes(awesome!!) using below flow. This throughput is enough to load 10TB data in a day with a single and simple machine. During the test, server disk I/O went up

Re: Data Ingestion forLarge Source Files and Masking

2016-01-03 Thread obaidul karim
Hi Joe, Yes, symlink is another option I was thinking when I was trying to use getfile. Thanks for your insights, I will update you on this mail chain when my entire workflow completes. So that thus could be an reference for other :). -Obaid On Monday, January 4, 2016, Joe Witt wrote: > Obaid,

Re: Data Ingestion forLarge Source Files and Masking

2016-01-03 Thread Joe Witt
Obaid, You make a great point. I agree we will ultimately need to do more to make that very valid approach work easily. The downside is that puts the onus on NiFi to keep track of a variety of potentially quite large state about the directory. One way to avoid that expense is if NiFi can pull a

Re: Data Ingestion forLarge Source Files and Masking

2016-01-03 Thread obaidul karim
Hi Joe, Condider a scenerio, where we need to feed some older files and we are using "mv" to feed files to input directory( to reduce IO we may use "mv"). If we use "mv", last modified date will not changed. And this is very common on a busy file collection system. However, I think I can still ma

Re: Data Ingestion forLarge Source Files and Masking

2016-01-03 Thread Joe Witt
Hello Obaid, The default behavior of the ListFile processor is to keep track of the last modified time of the files it lists. When you changed the name of the file that doesn't change the last modified time as tracked by the OS but when you altered content it does. Simply 'touch' on the file wou

Re: Data Ingestion forLarge Source Files and Masking

2016-01-03 Thread obaidul karim
Hi Joe, I am now exploring your solution. Starting with below flow: ListFIle > FetchFile > CompressContent > PutFile. Seems all fine. Except some confusion with how ListFile identifies new files. In order to test, I renamed a already processed file and put in in input folder and found that the f

Re: Data Ingestion forLarge Source Files and Masking

2016-01-01 Thread Joe Witt
Hello Obaid, At 6 TB/day and average size of 2-3GB per dataset you're looking at a sustained rate of 70+MB/s and a pretty low transaction rate. So well within a good range to work with on a single system. 'I's there any way to by pass writing flow files on disk or directly pass those files to HD

Data Ingestion forLarge Source Files and Masking

2016-01-01 Thread obaidul karim
Hi, I am new in Nifi and exploring it as open source ETL tool. As per my understanding, flow files are stored on local disk and it contains actual data. If above is true, lets consider a below scenario: Scenario 1: - In a spool directory we have terabytes(5-6TB/day) of files coming from external