Re: Advice on orchestrating Nifi with dockerized external services

2019-04-10 Thread Koji Kawamura
Hi Eric,

Although my knowledge on MiNiFi, Python and Go is limited, I wonder if
"nanofi" library can be used from the proprietary application so that
they can fetch FlowFiles directly using Site-to-Site protocol. That
can be an interesting approach and will be able to eliminate the need
of storing data to a local volume (mentioned in the possible approach
A).
https://github.com/apache/nifi-minifi-cpp/tree/master/nanofi

The latest MiNiFi (C++) version 0.6.0 was released recently.
https://cwiki.apache.org/confluence/display/MINIFI/Release+Notes

Thanks,
Koji

On Thu, Apr 11, 2019 at 2:28 AM Eric Chaves  wrote:
>
> Hi Folks,
>
> My company is using nifi to perform several data-flow process and now we 
> received a requirement to do some fairly complex ETL over large files. To 
> process those files we have some proprietary applications (mostly written in 
> phyton or go) that ran as docker containers.
>
> I don't think that porting those apps as nifi processors would produce a good 
> result due to each app complexity.
>
> Also we would like keep using the nifi queues so we can monitor overall 
> progress as we already do (we ran several other nifi flows) so we are 
> discarding for now solutions that for example submit files to an external 
> queue like SQS or Rabbit for consumption.
>
> So far we come up with two solutions that would:
>
> have kubernete cluster of running jobs periodically querying the nifi queue 
> for new flowfiles and pull one when a file arrives.
> download the file-content (which is already stored outside of nifi) and 
> process it.
> submit the result back to nifi (using a HTTP Listener processor) to trigger 
> subsequent nifi process.
>
>
> For step 1 and 2 so far we are considering two possible approaches:
>
> A) use a minifi container togheter with the app container in a sidecar 
> design. minifi would connect to our nifi cluster and handle file download to 
> a local volume for the app container process them.
>
> B) use nifi rest API to query and consume flowfiles on queue
>
> One requirement is that if needed we would manually scale up the app cluster 
> to have multiple containers consumer more queued files in parallel.
>
> Do you guys recommend one over another (or a third approach)? Any pitfalls 
> you can foresee?
>
> Would be really glad to hear your thoughts on this matter.
>
> Best regards,
>
> Eric


Re: GetHbase state

2019-04-10 Thread Koji Kawamura
Hi Dwane,

I agree with you, seeing duplicated loaded data with scenario 3
(staged) is strange. It should behave the same as scenario 1 if Pig
and NiFi were not running concurrently.

>From scenario 3 description:
> In this scenario we reloaded the same data several times and the state 
> behaved unusually after the first run.  Timestamp entries were placed in the 
> processor state but they appeared to be ignored on subsequent runs with the 
> data being reloaded several times.

I assume the test looks like:
1. Let's say there are 1000 rows total in the dataset to be loaded
2. Divide them into 10 chunks. Each has 100 rows
3. Pig inserts the first chunk to HBase, then stops
4. NiFi loads the first chunk from HBase, NiFi gets 100 rows. State is stored.
5. Stop NiFi GetHBase, run Pig to insert the 2nd chunk to HBase
6. After Pig(step 5) finishes, restart GetHBase. Restart GetHBase and
NiFi gets 100+ rows, containing duplicates that have seen at step 4

Do I understand it correctly? Please correct me if I'm wrong.

There's one possibility to see duplicated rows at step 6. If Pig put
different cells but within the same row at step 3 and step 5.
For example, if row(1) contains cell(A) and cell(B),
Pig puts cell(A) to HBase, then GetHBase get cell(A) at the first run.
Pig puts cell(B) to HBase, then GetHBase gets cell(A) and cell(B) at
the second run.

If that's not the case, I think we need to investigate it more.
To do so, would you be able to share more details, such as the stored
GetHBase state at the step 4 and what GetHBase get at step 6.
Executing scan operation at step 4 using the stored timestamp as min
timestamp from other tools than NiFi such as HBase shell would help to
know what timestamp each cell has. GetHBase doesn't output each cell
timestamp IIRC.

Thanks,
Koji

On Wed, Apr 10, 2019 at 6:25 PM Dwane Hall  wrote:
>
> Hey Koji,
>
> Thanks for the response and great question regarding the Pig job load 
> timestamp.  No we rely entirely on the Hbase timestamp value for our update 
> values.  We did pre-split our regions for greater throughput and to avoid 
> overloading a single region server.  Additionally, our key values were 
> perfectly distributed using the datafu MD5 hash algorithm to optimise 
> throughput.
>
> We suspected the same thing with the concurrent writes and distributed nature 
> of Hbase which is why we attempted scenario 3 (staged) below.  We were very 
> surprised that we reloaded data under these conditions.
>
> Thanks again for your input.
>
> Dwane
> 
> From: Koji Kawamura 
> Sent: Friday, 5 April 2019 6:08 PM
> To: users@nifi.apache.org
> Subject: Re: GetHbase state
>
> Hi Dwane,
>
> Does the Pig job puts HBase data with custom timestamps? For example,
> the loading data contains last_modified timestamp, and it's used as
> HBase cell timestamp.
> If that's the case, GetHbase may miss some HBase rows, unless the Pig
> job loads raws ordered by the timestamp when Pig and GetHBase run
> concurrently (case 2 in your scenarios).
>
> By looking at the case 2 and 3, I imagine there may be some rows
> containing custom timestamp and others don't.
>
> If Pig doesn't specify timestamp, HBase Region Server sets it.
> In that case, more recently written cell will have the more recent timestamp.
> However, due to the distributed nature of HBase, I believe chances are
> a scan operation using minimum timestamp misses some rows if rows
> being written simultaneously.
>
>
> # 
> # Idea for future improvement
> # 
> Currently GetHBase treats the last timestamp as a single point of time.
> In order to decrease the possibility to miss rows those are being
> written simultaneously, I think GetHBase should implement checking
> timestamps as a range.
> For example:
> - Add new processor property "Scan Timestamp Range", such as "3 secs"
> - Remember the latest timestamp in the previous scan, and all row keys
> (and ts) for those having its timestamp > "latest timestamp - 3 secs"
> - When performing a scan, set min_timestamp as "latestSeenTimestamp -
> 3 secs", this will contain duplication
> - Emit result using rows whose row keys are not contained in the
> previously scanned keys, or row timestamp is updated
>
> Thanks,
> Koji
>
> On Thu, Apr 4, 2019 at 8:47 PM Dwane Hall  wrote:
> >
> > Hey fellow NiFi fans,
> >
> > I was recently loading data into into Solr via HBase (around 700G 
> > ~60,000,000 db rows) using NiFi and noticed some inconsistent behaviour 
> > with the GetHbase processor and I'm wondering if anyone else has noticed 
> > similar behaviour when using it.
> >
> > Here's our environment and the high level workflow we were attempting:
> >
> > Apache NiFi 1.8
> > Two node cluster (external zookeeper maintaining processor state)
> > HBase 1.1.2
> >
> > Step 1 We execute an external Pig job to load data into a HBase table.
> > Step 2 (NiFi) We use a GetHbase processor listening to the HBase table for 
> > 

Re: Registry and db connection services

2019-04-10 Thread Max
OK that makes sense. But I could create a parent process group, add my
controller services there, and reference them from the child process
groups. Then I put the parent process group under version control and I
have my desired result, I think.

Is that common/useful pattern to have multiple child process groups and
only version the parent?

Sorry for the basic questions here, I'm still trying to figure out a good
pattern how to work with NiFi and deploy the result. :-)

On Wed, 10 Apr 2019 at 21:47, Bryan Bende  wrote:

> Hello,
>
> When you version control a process group, it captures all components
> in that group and child groups. If a component references a controller
> service from a parent group, then when you import this process group
> to the next environment you will have to re-link that component to the
> appropriate service in the parent group of the new environment. You
> only have to make this change when time during initial import. After
> that on future upgrades it will retain the link to the service.
>
> The reason it works this way is because NiFi can't know for sure what
> service in the new environment to select since its a service that was
> not part of version control. There is a JIRA for a future improvement
> to attempt to auto-select the service from a parent group when there
> is only one service of the required type with the same name (name is
> not unique so if there is more than one with the same name we can't
> know which one to select).
>
> -Bryan
>
> On Wed, Apr 10, 2019 at 5:17 AM Max  wrote:
> >
> > Hello!
> >
> > I’d like to use registry for deploying my flows in various environments.
> >
> > Let’s say on dev I create a flow that has an ExecuteSQL processor and
> this processor references a db controller service, which is defined on the
> root level.
> >
> > I then version my flow and commit. Next, I go to my production env and
> import the flow. The result is that the ExecuteSQL processor on prod
> > will be not enabled because the property “Database Connection Pooling
> Service” will say “Incompatible Controller Service configured” where on dev
> I had the db referenced.
> >
> > Maybe I don’t get versioning in NiFi yet but how is that supposed to
> work to reference controller services (db connections in particular) that
> are defined on the root level when using registry?
>


Advice on orchestrating Nifi with dockerized external services

2019-04-10 Thread Eric Chaves
Hi Folks,

My company is using nifi to perform several data-flow process and now we
received a requirement to do some fairly complex ETL over large files. To
process those files we have some proprietary applications (mostly written
in phyton or go) that ran as docker containers.

I don't think that porting those apps as nifi processors would produce a
good result due to each app complexity.

Also we would like keep using the nifi queues so we can monitor overall
progress as we already do (we ran several other nifi flows) so we are
discarding for now solutions that for example submit files to an external
queue like SQS or Rabbit for consumption.

So far we come up with two solutions that would:

   1. have kubernete cluster of running jobs periodically querying the nifi
   queue for new flowfiles and pull one when a file arrives.
   2. download the file-content (which is already stored outside of nifi)
   and process it.
   3. submit the result back to nifi (using a HTTP Listener processor) to
   trigger subsequent nifi process.


For step 1 and 2 so far we are considering two possible approaches:

A) use a minifi container togheter with the app container in a sidecar
design. minifi would connect to our nifi cluster and handle file download
to a local volume for the app container process them.

B) use nifi rest API to query and consume flowfiles on queue

One requirement is that if needed we would manually scale up the app
cluster to have multiple containers consumer more queued files in parallel.

Do you guys recommend one over another (or a third approach)? Any pitfalls
you can foresee?

Would be really glad to hear your thoughts on this matter.

Best regards,

Eric


[ANNOUNCE] Apache NiFi 1.9.2 release.

2019-04-10 Thread Joe Witt
Hello

The Apache NiFi team would like to announce the release of Apache NiFi
1.9.2.

Apache NiFi is an easy to use, powerful, and reliable system to process and
distribute
data.  Apache NiFi was made for dataflow.  It supports highly configurable
directed graphs
of data routing, transformation, and system mediation logic.

More details on Apache NiFi can be found here:
https://nifi.apache.org/

The release artifacts can be downloaded from here:
https://nifi.apache.org/download.html

Maven artifacts have been made available.

Issues closed/resolved for this list can be found here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12316020=12345213

Release note highlights can be found here:
https://cwiki.apache.org/confluence/display/NIFI/Release+Notes#ReleaseNotes-Version1.9.2

Thank you
The Apache NiFi team


RE: HiveQL to load data in hive after generating file on HDFS

2019-04-10 Thread DEHAY Aurelien
Hello.

Thanks (and thanks Matt).

I saw the answer already posted on forums AFTER my mail, nice however to have 
the explanation about the memory and repositories.

Thanks.

From: Bryan Bende 
Sent: mardi 9 avril 2019 16:52
To: users@nifi.apache.org
Subject: Re: HiveQL to load data in hive after generating file on HDFS

I meant to say that in ReplaceText you would need use "Always Replace" with 
mode of "Entire Text".

On Tue, Apr 9, 2019 at 10:38 AM Matt Burgess 
mailto:mattyb...@apache.org>> wrote:
Since you know you just want to overwrite the contents with HiveQL, you could 
use ExecuteScript with the following Groovy script:

def flowFile = session.get()
if(!flowFile) return
try {
   def path = flowFile.getAttribute('absolute.hdfs.path')
   flowFile = session.write(flowFile, {outStream ->
  outStream.write("LOAD DATA INPATH '$path' INTO TABLE myTable".bytes)
   } as OutputStreamCallback)
   session.transfer(flowFile, REL_SUCCESS)
} catch(e) {
   log.error('Error generating HiveQL', e)
   session.transfer(flowFile, REL_FAILURE)
}

Or better yet, you can paste that into InvokeScriptedProcessor using a template 
[1] I created to allow users to port their ExecuteScript body to the faster 
InvokeScriptedProcessor. It might be nice to have a SetFlowFileContent 
processor for this use case, if ReplaceText does load the incoming content into 
memory.

Regards,
Matt

[1] 
https://funnifi.blogspot.com/2017/06/invokescriptedprocessor-template-faster.html



On Tue, Apr 9, 2019 at 9:49 AM DEHAY Aurelien 
mailto:aurelien.de...@faurecia.com>> wrote:
Hello.

We would like, in a flow to:

1)  Read data from kafka

2)  Merge records

3)  PutParquet on HDFS

4)  Load data in Hive using LOAD DATA hiveql.

The first 3 steps are no problem, it’s working fine.

But we wonder what is the best way to launch the hiveql command. We planned to 
use the PutHiveQL processor, but it needs the command in the flowfile content.

Using generateflowfile would be nice, but we can’t generate the event to 
trigger the generateflowfile, preventing too to use the wait/notify.

What is the best way? I suppose replacetext or so would be too “heavy” has it 
requires to load the message in memory?

Thanks for any pointer/idea.


Aurélien DEHAY
Big Data Architect
+33 616 815 441
aurelien.de...@faurecia.com

23/27 avenue des Champs Pierreux
92735 Nanterre Cedex – France
[Faurecia_inspiring_mobility_logo-RVB_150]


This electronic transmission (and any attachments thereto) is intended solely 
for the use of the addressee(s). It may contain confidential or legally 
privileged information. If you are not the intended recipient of this message, 
you must delete it immediately and notify the sender. Any unauthorized use or 
disclosure of this message is strictly prohibited.  Faurecia does not guarantee 
the integrity of this transmission and shall therefore never be liable if the 
message is altered or falsified nor for any virus, interception or damage to 
your system.

This electronic transmission (and any attachments thereto) is intended solely 
for the use of the addressee(s). It may contain confidential or legally 
privileged information. If you are not the intended recipient of this message, 
you must delete it immediately and notify the sender. Any unauthorized use or 
disclosure of this message is strictly prohibited.  Faurecia does not guarantee 
the integrity of this transmission and shall therefore never be liable if the 
message is altered or falsified nor for any virus, interception or damage to 
your system.


Re: Registry and db connection services

2019-04-10 Thread Bryan Bende
Hello,

When you version control a process group, it captures all components
in that group and child groups. If a component references a controller
service from a parent group, then when you import this process group
to the next environment you will have to re-link that component to the
appropriate service in the parent group of the new environment. You
only have to make this change when time during initial import. After
that on future upgrades it will retain the link to the service.

The reason it works this way is because NiFi can't know for sure what
service in the new environment to select since its a service that was
not part of version control. There is a JIRA for a future improvement
to attempt to auto-select the service from a parent group when there
is only one service of the required type with the same name (name is
not unique so if there is more than one with the same name we can't
know which one to select).

-Bryan

On Wed, Apr 10, 2019 at 5:17 AM Max  wrote:
>
> Hello!
>
> I’d like to use registry for deploying my flows in various environments.
>
> Let’s say on dev I create a flow that has an ExecuteSQL processor and this 
> processor references a db controller service, which is defined on the root 
> level.
>
> I then version my flow and commit. Next, I go to my production env and import 
> the flow. The result is that the ExecuteSQL processor on prod
> will be not enabled because the property “Database Connection Pooling 
> Service” will say “Incompatible Controller Service configured” where on dev I 
> had the db referenced.
>
> Maybe I don’t get versioning in NiFi yet but how is that supposed to work to 
> reference controller services (db connections in particular) that are defined 
> on the root level when using registry?


Re: GetHbase state

2019-04-10 Thread Dwane Hall
Hey Koji,

Thanks for the response and great question regarding the Pig job load 
timestamp.  No we rely entirely on the Hbase timestamp value for our update 
values.  We did pre-split our regions for greater throughput and to avoid 
overloading a single region server.  Additionally, our key values were 
perfectly distributed using the datafu MD5 hash algorithm to optimise 
throughput.

We suspected the same thing with the concurrent writes and distributed nature 
of Hbase which is why we attempted scenario 3 (staged) below.  We were very 
surprised that we reloaded data under these conditions.

Thanks again for your input.

Dwane

From: Koji Kawamura 
Sent: Friday, 5 April 2019 6:08 PM
To: users@nifi.apache.org
Subject: Re: GetHbase state

Hi Dwane,

Does the Pig job puts HBase data with custom timestamps? For example,
the loading data contains last_modified timestamp, and it's used as
HBase cell timestamp.
If that's the case, GetHbase may miss some HBase rows, unless the Pig
job loads raws ordered by the timestamp when Pig and GetHBase run
concurrently (case 2 in your scenarios).

By looking at the case 2 and 3, I imagine there may be some rows
containing custom timestamp and others don't.

If Pig doesn't specify timestamp, HBase Region Server sets it.
In that case, more recently written cell will have the more recent timestamp.
However, due to the distributed nature of HBase, I believe chances are
a scan operation using minimum timestamp misses some rows if rows
being written simultaneously.


# 
# Idea for future improvement
# 
Currently GetHBase treats the last timestamp as a single point of time.
In order to decrease the possibility to miss rows those are being
written simultaneously, I think GetHBase should implement checking
timestamps as a range.
For example:
- Add new processor property "Scan Timestamp Range", such as "3 secs"
- Remember the latest timestamp in the previous scan, and all row keys
(and ts) for those having its timestamp > "latest timestamp - 3 secs"
- When performing a scan, set min_timestamp as "latestSeenTimestamp -
3 secs", this will contain duplication
- Emit result using rows whose row keys are not contained in the
previously scanned keys, or row timestamp is updated

Thanks,
Koji

On Thu, Apr 4, 2019 at 8:47 PM Dwane Hall  wrote:
>
> Hey fellow NiFi fans,
>
> I was recently loading data into into Solr via HBase (around 700G ~60,000,000 
> db rows) using NiFi and noticed some inconsistent behaviour with the GetHbase 
> processor and I'm wondering if anyone else has noticed similar behaviour when 
> using it.
>
> Here's our environment and the high level workflow we were attempting:
>
> Apache NiFi 1.8
> Two node cluster (external zookeeper maintaining processor state)
> HBase 1.1.2
>
> Step 1 We execute an external Pig job to load data into a HBase table.
> Step 2 (NiFi) We use a GetHbase processor listening to the HBase table for 
> new data - Execution context set to Primary Node only.
> Step 3 (NiFi) Some light attribute addition and eventually the data is stored 
> in Solr using PutSolrContentStream.
>
> What we found during our testing is that the GetHBase processor did not 
> appear to accurately maintain its state as data was being loaded out of 
> Hbase.  We tried a number of load strategies with varying success.
>
> No concurrency - Wait for all data to be loaded into HBase by the Pig job and 
> then dump all 700G of data into NiFi. This was successful as there was no 
> state dependency but we lose the ability to run the Solr load and Pig job in 
> parallel and dump a lot of data on NiFi at once.
> Concurrent - Pig job and GetHbase processor running concurrently. Here we 
> found we missed ~ 30% of the data we were loading into Solr.
> Staged. Load a portion of data into HBase using Pig, start the GetHbase 
> processor to load that portion of data and repeat until all data is loaded 
> (Pig job and GetHbase processor were never run concurrently).  In this 
> scenario we reloaded the same data several times and the state behaved 
> unusually after the first run.  Timestamp entries were placed in the 
> processor state but they appeared to be ignored on subsequent runs with the 
> data being reloaded several times.
>
>
> We were able to work around the issue using the ScanHbase processor and 
> specifying our key range to load the data so it was not a showstopper for us. 
>  I'm just wondering if any other users in the community have had similar 
> experiences using this processor or if I need to revise my local environment 
> configuration?
>
> Thanks,
>
> Dwane


Registry and db connection services

2019-04-10 Thread Max
Hello!

I’d like to use registry for deploying my flows in various environments.

Let’s say on dev I create a flow that has an ExecuteSQL processor and this
processor references a db controller service, which is defined on the root
level.

I then version my flow and commit. Next, I go to my production env and
import the flow. The result is that the ExecuteSQL processor on prod
will be not enabled because the property “Database Connection Pooling
Service” will say “Incompatible Controller Service configured” where on dev
I had the db referenced.

Maybe I don’t get versioning in NiFi yet but how is that supposed to work
to reference controller services (db connections in particular) that are
defined on the root level when using registry?