Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-24 Thread Andy LoPresto
Excellent information here. Thanks Lee and Peter.

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 24, 2016, at 6:57 AM, Peter Wicks (pwicks)  wrote:
> 
> Prabhu,
> 
> Lee mentioned making sure you have good indexes, but I would caution you on 
> this point.  If you have a unique constraint then SQL Server will build an 
> index on this automatically, but I would suggest dropping all other indexes 
> that aren’t related to data integrity. Each time SQL Server updates a column 
> that is indexed it’s going to be updating that index also.  This will add a 
> lot of overhead.
> 
> You might be thinking that you need these indexes though for user queries. To 
> work around this I often see the use of a staging table. This table has no 
> indexes beyond the absolute minimum to ensure data integrity, and sometimes 
> even these are removed and data integrity/duplicate removal is handled 
> through the use of SQL or a Stored Procedure.  A periodic job will move all 
> data from this staging table into a final table.  If you execute the copy and 
> a truncate in a single transaction it allows you to do this safely:
> 
> INSERT INTO “Some_Final_Table” SELECT * FROM 
> “Staging_Table_With_Exact_Same_schema”;TRUNCATE TABLE 
> “Staging_Table_With_Exact_Same_schema”;
> 
> If you do it this way you can keep the indexes you need for user access while 
> still allowing maximum data throughput to SQL Server.
> 
> I’ve seen a lot of comments online about batch sizing around 500 being 
> optimal, but of course this will vary on the system configuration; both your 
> NiFi server and the SQL Server.
> 
> I have had issues getting good performance out of PutSQL even with the above, 
> I don’t think this is the fault of the processor, but more due to the volume 
> of data and JDBC batch row processing not really being designed for this kind 
> of volume. In my case I was trying to push about 10M rows over a longer time 
> period, but was still running into trouble. After working on the issue for a 
> while I found that a database specific loader was needed. I am loading to 
> Teradata, so I wrote up a Teradata FastLoad processor.  In your case the MS 
> SQL Server JDBC Driver includes a `SQLServerBulkCopy` loader, 
> https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx 
> .  
> Unfortunately, this would require writing code either through a scripted 
> processor, or as a whole new processor.
> 
> Since writing a custom processor may be more than you want to jump into right 
> now you should probably take a look at `bcp`.  I didn’t catch if you were on 
> Windows or a Unix platform, but if you are on Windows I’d check out the 
> command line based Bulk Copy Program for MS 
> SQL:https://msdn.microsoft.com/en-us/library/ms162802.aspx 
> .  Using this would 
> allow you to prepare your data into an intermediary format, like CSV first, 
> then send it all at once through `bcp`.
> 
> 
> Thanks,
>   Peter Wicks
> 
> From: Lee Laim [mailto:lee.l...@gmail.com]
> Sent: Monday, October 24, 2016 7:17 AM
> To: users@nifi.apache.org
> Subject: Re: How to increase the processing speed of the ExtractText and 
> ReplaceText Processor?
> 
> Hello Prabhu,
> 
> 50 minutes is a good start! Now we have to determine where the next 
> bottleneck is -check to see where the flow files are queueing.  You can also 
> check the "average task duration" statistic for each processor.  I suspect 
> the bottleneck is at  PutSQL and will carry this assumption forward.
> 
> There are several knobs you can adjust at the assumed PutSQL bottleneck:
> 1.  Increase the run duration and keep the PutSQL processor running for 2 
> secondsbefore releasing the thread.
> 2. Set Fragmented Transactions to false.  This removes constraints that take 
> time to check.
> 3. Consider changing batch size, systematically and observe throughput 
> changes. I'd move up in increments of 100.
> 4*. Increase the number of concurrent tasks for the bottleneck processor to 3 
> or higher.  Increase, systematically to observe if you get more flow files 
> through.   You can increase the max timer driven threads of the NiFi instance 
> in the NiFi Flow Settings (top right of the canvas).  you can set the max to 
> 25, but you are truly limited by hardware here. Consider a more powerful 
> system to manage this flow, especially with the time constraint you need. It 
> is often easier to throw more hardware at the problem than to debug.
> Other areas:
> 5. On the output of the last SplitText processor,  Invoke back pressure 
> object threshold = 1.  This will slow (temporarily stop) the first split text 
> processor and reduce the number of overall flow files to manage.  It also 
> reduces the NiFi processor demand for the cpu threads.
> 6. 

Re: Can we configure NiFi to run execute process with specific Kerberos Principal?

2016-10-24 Thread Bryan Bende
Hi Ravi,

I'm not very familiar with Sqoop, but from quickly reading their
documentation and some other forums/blogs, it seems like the script that
NiFi is calling should be doing something like the follow:

kinit 
/bin/sqoop.sh client

I would think however you execute the script successfully outside of NiFi,
would be the same with NiFI, meaning that NiFi is just calling a shell
script and shouldn't really need to know that Kerberos is involved.

-Bryan


On Mon, Oct 24, 2016 at 11:22 AM, Ravi Papisetti (rpapiset) <
rpapi...@cisco.com> wrote:

> Hi,
>
> We are planning to use "ExecuteProcess" to run a sqoop script wrapped by
> shell. As part of this we want NiFi to use its service principal in secure
> mode while submiting executing the script. Otherwise sqoop script is failed
> to execute saying "Caused by GSSException: No valid credentials provided
> (Mechanism level: Failed to find any Kerberos tgt)", because it is
> submitted by NiFi service user (root) that doesn't have any kerberos user
> principal.
>
> Are there any configuration options in NiFi to overcome this issue?
>
> Our use case is very similar to what is posted here: https://community.
> hortonworks.com/questions/18069/how-to-execute-commands-
> using-executeprocess-proce.html
>
>
>
> Thanks,
>
> Ravi Papisetti
>
> Technical Leader
>
> Services Technology Incubation Center
> 
>
> rpapi...@cisco.com
>
> Phone: +1 512 340 3377
>
>
> [image: stic-logo-email-blue]
>


Can we configure NiFi to run execute process with specific Kerberos Principal?

2016-10-24 Thread Ravi Papisetti (rpapiset)
Hi,

We are planning to use "ExecuteProcess" to run a sqoop script wrapped by shell. 
As part of this we want NiFi to use its service principal in secure mode while 
submiting executing the script. Otherwise sqoop script is failed to execute 
saying "Caused by GSSException: No valid credentials provided (Mechanism level: 
Failed to find any Kerberos tgt)", because it is submitted by NiFi service user 
(root) that doesn't have any kerberos user principal.

Are there any configuration options in NiFi to overcome this issue?

Our use case is very similar to what is posted here: 
https://community.hortonworks.com/questions/18069/how-to-execute-commands-using-executeprocess-proce.html



Thanks,

Ravi Papisetti

Technical Leader

Services Technology Incubation 
Center

rpapi...@cisco.com

Phone: +1 512 340 3377


[stic-logo-email-blue]


RE: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-24 Thread Peter Wicks (pwicks)
Prabhu,

Lee mentioned making sure you have good indexes, but I would caution you on 
this point.  If you have a unique constraint then SQL Server will build an 
index on this automatically, but I would suggest dropping all other indexes 
that aren’t related to data integrity. Each time SQL Server updates a column 
that is indexed it’s going to be updating that index also.  This will add a lot 
of overhead.

You might be thinking that you need these indexes though for user queries. To 
work around this I often see the use of a staging table. This table has no 
indexes beyond the absolute minimum to ensure data integrity, and sometimes 
even these are removed and data integrity/duplicate removal is handled through 
the use of SQL or a Stored Procedure.  A periodic job will move all data from 
this staging table into a final table.  If you execute the copy and a truncate 
in a single transaction it allows you to do this safely:

INSERT INTO “Some_Final_Table” SELECT * FROM 
“Staging_Table_With_Exact_Same_schema”;TRUNCATE TABLE 
“Staging_Table_With_Exact_Same_schema”;

If you do it this way you can keep the indexes you need for user access while 
still allowing maximum data throughput to SQL Server.

I’ve seen a lot of comments online about batch sizing around 500 being optimal, 
but of course this will vary on the system configuration; both your NiFi server 
and the SQL Server.

I have had issues getting good performance out of PutSQL even with the above, I 
don’t think this is the fault of the processor, but more due to the volume of 
data and JDBC batch row processing not really being designed for this kind of 
volume. In my case I was trying to push about 10M rows over a longer time 
period, but was still running into trouble. After working on the issue for a 
while I found that a database specific loader was needed. I am loading to 
Teradata, so I wrote up a Teradata FastLoad processor.  In your case the MS SQL 
Server JDBC Driver includes a `SQLServerBulkCopy` loader, 
https://msdn.microsoft.com/en-us/library/mt221490%28v=sql.110%29.aspx.  
Unfortunately, this would require writing code either through a scripted 
processor, or as a whole new processor.

Since writing a custom processor may be more than you want to jump into right 
now you should probably take a look at `bcp`.  I didn’t catch if you were on 
Windows or a Unix platform, but if you are on Windows I’d check out the command 
line based Bulk Copy Program for MS SQL: 
https://msdn.microsoft.com/en-us/library/ms162802.aspx.  Using this would allow 
you to prepare your data into an intermediary format, like CSV first, then send 
it all at once through `bcp`.


Thanks,
  Peter Wicks

From: Lee Laim [mailto:lee.l...@gmail.com]
Sent: Monday, October 24, 2016 7:17 AM
To: users@nifi.apache.org
Subject: Re: How to increase the processing speed of the ExtractText and 
ReplaceText Processor?

Hello Prabhu,

50 minutes is a good start! Now we have to determine where the next bottleneck 
is -check to see where the flow files are queueing.  You can also check the 
"average task duration" statistic for each processor.  I suspect the bottleneck 
is at  PutSQL and will carry this assumption forward.

There are several knobs you can adjust at the assumed PutSQL bottleneck:
1.  Increase the run duration and keep the PutSQL processor running for 2 
seconds before releasing the thread.
2. Set Fragmented Transactions to false.  This removes constraints that take 
time to check.
3. Consider changing batch size, systematically and observe throughput changes. 
I'd move up in increments of 100.
4*. Increase the number of concurrent tasks for the bottleneck processor to 3 
or higher.  Increase, systematically to observe if you get more flow files 
through.   You can increase the max timer driven threads of the NiFi instance 
in the NiFi Flow Settings (top right of the canvas).  you can set the max to 
25, but you are truly limited by hardware here. Consider a more powerful system 
to manage this flow, especially with the time constraint you need. It is often 
easier to throw more hardware at the problem than to debug.
Other areas:
5. On the output of the last SplitText processor,  Invoke back pressure object 
threshold = 1.  This will slow (temporarily stop) the first split text 
processor and reduce the number of overall flow files to manage.  It also 
reduces the NiFi processor demand for the cpu threads.
6. Increase nifi.queue.swap.threshold in nifi.properties-  reduce disk access.
7. Check connection/load on the SQL server.

To address your queries,I used the same expression you provided: 
(.+)[|](.+)[|](.+)[|](.+)
You can use an ExtractStreamCommand processor to 'extract text', but in this 
case, with small flow files, it won't offer much gain.

*With an i5 processor, you have 4 cpu threads to process flow files, manage 
NiFi, read/write to disk, and handle all other non-NiFi processes.  Moving to 
an i7 or Xeon, hyper threading will provide NiFi more resources 

Re: How to increase the processing speed of the ExtractText and ReplaceText Processor?

2016-10-24 Thread Lee Laim
Hello Prabhu,

50 minutes is a good start! Now we have to *determine where the next
bottleneck is *-check to see where the flow files are queueing.  You can
also check the "average task duration" statistic for each processor.  I
suspect the bottleneck is at  PutSQL and will carry this assumption
forward.

There are several knobs you can adjust at the assumed PutSQL bottleneck:

1.  Increase the run duration and keep the PutSQL processor running for *2
seconds* before releasing the thread.
2. Set F*ragmented Transactions to false.  *This removes constraints that
take time to check.
3. Consider changing *batch size*, systematically and observe throughput
changes. I'd move up in increments of 100.
4*. Increase the number of concurrent tasks for the bottleneck processor to
3 or higher.  Increase, systematically to observe if you get more flow
files through.   You can increase the *max timer driven threads* of the
NiFi instance in the *NiFi Flow Settings* (top right of the canvas).  you *can
*set the max to 25, but you are truly limited by hardware here. Consider a
more powerful system to manage this flow, especially with the time
constraint you need. It is often easier to throw more hardware at the
problem than to debug.

Other areas:

5. On the output of the last SplitText processor,  Invoke *back pressure
object threshold = 1*.  This will slow (temporarily stop) the first split
text processor and reduce the number of overall flow files to manage.  It
also reduces the NiFi processor demand for the cpu threads.
6. Increase nifi.queue.swap.threshold in nifi.properties-  reduce disk
access.
7. Check connection/load on the SQL server.


To address your queries,I used the same expression you provided:
(.+)[|](.+)[|](.+)[|](.+)
You *can* use an ExtractStreamCommand processor to 'extract text', but in
this case, with small flow files, it won't offer much gain.

*With an i5 processor, you have 4 cpu threads to process flow files, manage
NiFi, read/write to disk, and handle all other non-NiFi processes.  Moving
to an i7 or Xeon, hyper threading will provide NiFi more resources to
really get work done.  While clustering is great for increasing throughput,
I wouldn't recommend clustering on a set of smaller i5 systems, as there is
added communication overhead that will need to be handled by the cpu.  It
can be done, but there are easier ways to increase throughput at this point
in development.

Hope this helps.  Also, if there is anything I stated that is contrary to
what others have observed, please chime in.

Thanks,
Lee


On Thu, Oct 20, 2016 at 6:02 PM, prabhu Mahendran 
wrote:

> Lee,
>
> I have tried your suggested flow which can able to insert the data into
> sql server in 50 minutes And it also take long time.
>
> *==>*your Query:
> *You might be processing the entire dat file (instead of a single row) for
> each record.*
>   How can i process entire dat file into SQL Server?
>
>
> *==>Query:Without any new optimizations you'll need ~25 threads and
> sufficient memory to feed the threads.*
>   My processors runs in 10 threads only by setting concurrent threads,How
> to increase it to be 25 threads.
>
> If you try quick test then please share "what is regex which you have
> used?"
>
> Is there any other processor having functionality like extract text?
>
> Thanks
>
>
>
> On Wed, Oct 19, 2016 at 11:29 PM, Lee Laim  wrote:
>
>> Prabu,
>>
>> In order to move 3M rows in 10 minutes, you'll need to process 5000
>> rows/second.
>> During your 4 hour run, you were processing ~200 rows/second.
>>
>> Without any new optimizations you'll need ~25 threads and sufficient
>> memory to feed the threads.  I agree with Mark and you should be able to
>> get far more than 200 rows/second.
>>
>> I ran a quick test using your ExtractText regex on similar data I was
>> able to process over 100,000 rows/minute through the extract text
>> processor.  The input data was a single row of 4 fields delimited by the
>> "|" symbol.
>>
>> *You might be processing the entire dat file (instead of a single row)
>> for each record.*
>> *Can you check the flow file attributes and content going into
>> ExtractText?  *
>>
>>
>> Here is the flow with some notes:
>>
>> 1.GetFile (a 30 MB .dat file consisting of 3M rows; each row is about 10
>> bytes)
>>
>> 2 SplitText -> SplitText  (to break the 3M rows down to manageable chunks
>> of 10,000 lines per flow file, then split again to 1 line per flow file)
>>
>> 3. ExtractText to extract the 4 fields
>>
>> 4. ReplaceText to generate json (You can alternatively use
>> AttributesToJson here)
>>
>> 5. ConvertJSONtoSQL
>>
>> 6. PutSQL - (This should be true bottleneck; Index the DB well and use
>> many threads)
>>
>>
>> If my assumptions are incorrect, please let me know.
>>
>> Thanks,
>> Lee
>>
>> On Thu, Oct 20, 2016 at 1:43 AM, Kevin Verhoeven <
>> kevin.verhoe...@ds-iq.com> wrote:
>>
>>> I’m not clear on how much data you are processing, does the data(.dat)
>>> 

start it ONCE

2016-10-24 Thread Alessio Palma
Hello All, here is a question which is driving me crazy...
I'm using the PutHiveQL processor to create some tables and view into
HIVE. This processor need the SQL to execute into a flowfile.

I'm using a GenerateFlowFile to build an empty flowfile which is later
filled using the ReplaceText processor.

It Works but the cluster has 3 host and I'm getting the same query 3 times.

If I use the "On primany node" scheduling stretegy into the
GenerateFlowFile, anything works but I'm no longer able to use the CRON
scheduling option.

So.. There is a way to produce a SINGLE FlowFile into a cluster using a
"cron scheduling" option ?