from:"Mike Thomsen"

Re: SELinux and NiFi

2024-03-08 Thread Mike Thomsen

Similar situation here. Not a devops guy. So I guess it’s fair to say this is expected behavior if policies are not written to allow nifi.sh and such to access the folders NiFi needs to access. Thanks for the help.Sent from my iPhoneOn Mar 8, 2024, at 5:38 PM, Russell Bateman wrote:

That's what lies in those "SELinux policies."

I think it's simple: Use SELinux to lock the filesystem (and other
stuff) up so no one can get in or go around. Then create specific
policies that allow, in this case, NiFi, access to its filesystem
(like /opt/nifi/current-nifi/) so that it can do work.
Obviously, when you install NiFi, things can get complicated like
where do you want each repository to live--you'll have to provide
NiFi access to each place, no longer a single filesystem.

This is handled by DevOps guys and not me (I just write custom
processors), but if you get real pointed, I can ask them better
questions they can answer.

Russ

On 3/8/24 15:04, Mike Thomsen wrote:

I think the admin told me that even a simple nifi.sh start
won’t work. Just won’t even start the script and it is marked
executable. I was wondering if there were any gotchas to getting
a basic setup running.

Sent from my iPhone

On Mar 8, 2024, at 4:29 PM, Russell
Bateman wrote:

We have run on CentOS with SELinux set to
enforcing and have run NiFi in that environment for probably
8 or 9 years now. We do install some SELinux policies that
allow NiFi to access the filesystem underneath itself and
not outside that filesystem.

What specifically are you asking?

On 3/8/24 14:04, Mike Thomsen
wrote:

Does
anyone have experience setting up NiFi w/ SELinux set to
"enforcing?"

Re: SELinux and NiFi

2024-03-08 Thread Mike Thomsen

I think the admin told me that even a simple nifi.sh start won’t work. Just 
won’t even start the script and it is marked executable. I was wondering if 
there were any gotchas to getting a basic setup running.


Sent from my iPhone

> On Mar 8, 2024, at 4:29 PM, Russell Bateman  wrote:
> 
>  We have run on CentOS with SELinux set to enforcing and have run NiFi in 
> that environment for probably 8 or 9 years now. We do install some SELinux 
> policies that allow NiFi to access the filesystem underneath itself and not 
> outside that filesystem.
> 
> What specifically are you asking?
> 
>> On 3/8/24 14:04, Mike Thomsen wrote:
>> Does anyone have experience setting up NiFi w/ SELinux set to "enforcing?"
>

SELinux and NiFi

2024-03-08 Thread Mike Thomsen

Just did a search through the docs on Google and nothing came up for
SELinux.

Does anyone have experience setting up NiFi w/ SELinux set to "enforcing?"

Thanks,

Mike

Re: New Apache NiFi Website Design Launched

2024-01-09 Thread Mike Thomsen

That's a very impressive update!

On Mon, Jan 8, 2024 at 2:38 PM Jörg Hammerbacher 
wrote:

> Looks Great! Well Done!
>
>
> Jörg
>
> Am 08.01.2024 um 19:17 schrieb David Handermann:
> > Team,
> >
> > Thanks to a collaborative effort from several designers and
> > developers, the Apache NiFi project website has a new look, with more
> > prominent links to downloads, documentation, and source code!
> >
> > https://nifi.apache.org
> >
> > There is more work to be done in particular areas like generated
> > documentation, but as with the project itself, the website is open for
> > collaborative input through Jira [1] and GitHub [2].
> >
> > Regards,
> > David Handermann
> > Apache NiFi PMC Member
> >
> > [1] https://issues.apache.org/jira/browse/NIFI
> > [2] https://github.com/apache/nifi-site
>
>

Re: Apache Tike compatible with NiFi 1.16.3

2023-11-22 Thread Mike Thomsen

If you need to use common Tika functionality, just grab the NiFi media NAR
from the main Maven repo. It has common features like identifying mime
types and extracting text content.

On Sun, Nov 12, 2023 at 6:42 AM James McMahon  wrote:

> Where can I look to determine which Apache Tika version should be
> downloaded to add to my lib directory for my Apache NiFi 1.16.3
> installation?
>

Re: Parameter context syncing inconsistency with base context

2023-11-02 Thread Mike Thomsen

I'll see if I can do that, but here's the workflow I went through. We're on
1.21.0 at least for NiFi. Still verifying what version of the registry
we're on, but I think it's recent.

1. Create a PG, provide a parameter context on cluster a.
2. Create a base parameter context, assign that to the context from #1 as
its parent.
3. Deploy the PG via Registry import on a cluster b
4. Add multiple keys (sensitive or not) to the base context on cluster a
5. Reference keys from #4 in a component in the flow.
6. Save changes on the PG on cluster a
7. Attempt to upgrade the version on cluster b

Note: I brought the flow directly into a third cluster and it imported
correctly on the first pass.

On Thu, Nov 2, 2023 at 10:06 AM Bryan Bende  wrote:

> Mike,
>
> Can you come up with a small reproducible example to create the problem?
>
> Also, is it the latest version of NiFi Registry?
>
>
>
> On Wed, Nov 1, 2023 at 3:14 PM Mike Thomsen 
> wrote:
>
>> I refactored a bunch of related flows to use a base parameter context. I
>> added a few changes on my development cluster to a flow including two new
>> properties added to the base parameter context. Saved those changes into
>> our NiFi Registry, and then upgraded the test cluster to use that new
>> version of the flow. I noticed that none of the new properties were
>> migrated over. Any thoughts on how to fix this besides just manually adding
>> the changes? This feels like a bug in the parameter change management.
>>
>> Thanks,
>>
>> Mike
>>
>

Parameter context syncing inconsistency with base context

2023-11-01 Thread Mike Thomsen

I refactored a bunch of related flows to use a base parameter context. I
added a few changes on my development cluster to a flow including two new
properties added to the base parameter context. Saved those changes into
our NiFi Registry, and then upgraded the test cluster to use that new
version of the flow. I noticed that none of the new properties were
migrated over. Any thoughts on how to fix this besides just manually adding
the changes? This feels like a bug in the parameter change management.

Thanks,

Mike

Re: NiFi hanging during large sql query

2023-09-13 Thread Mike Thomsen

The resolution was to manually paginate the query statement with the
appropriate postgres syntax and set the fetchSize to 250 instead of 0,
which is unlimited.

On Mon, Sep 11, 2023 at 8:54 AM  wrote:

>
> Hello Mike,
>
> Could you please give me the details about the resolution ?
> Have you change something in the processor or just changing the sql
> command ?
>
> Regards
>
> *Envoyé:* samedi 2 septembre 2023 à 00:00
> *De:* "Mike Thomsen" 
> *À:* users@nifi.apache.org
> *Objet:* NiFi hanging during large sql query
> I have a three node cluster with an executesqlrecord processor with
> primary execution only. The sql it runs is a straight forward select on a
> table with about 44m records. If I leave it running, after about 10 min the
> node becomes unresponsive and leaves the cluster. The query runs just fine
> in jetbrains data grip on that postgresql server, so I don’t think it’s
> anything weird with the db or query. Any ideas about what could be causing
> this? Even with a high limit like 5m records the query doesn’t lock up the
> NiFi node.
>
> Sent from my iPhone
>
>
>

Re: NiFi hanging during large sql query

2023-09-04 Thread Mike Thomsen

I don't think so.

On Sat, Sep 2, 2023 at 11:36 AM Mark Payne  wrote:

> Thanks for sharing the solution Mike. Is there something we need to update
> in nifi to prevent this from biting others?
>
> Thanks
> Mark
>
> Sent from my iPhone
>
> On Sep 2, 2023, at 9:48 AM, Joe Witt  wrote:
>
> 
> Nice.  Gald you found it.
>
> On Sat, Sep 2, 2023 at 5:07 AM Mike Thomsen 
> wrote:
>
>> It was the PostgreSQL JDBC driver. If you don't paginate the query
>> aggressively, it will try to load a significant chunk of the table into
>> memory rather than just pulling chunks, even with fetchSize set low.
>>
>> On Fri, Sep 1, 2023 at 6:01 PM Mike Thomsen 
>> wrote:
>>
>>> I have a three node cluster with an executesqlrecord processor with
>>> primary execution only. The sql it runs is a straight forward select on a
>>> table with about 44m records. If I leave it running, after about 10 min the
>>> node becomes unresponsive and leaves the cluster. The query runs just fine
>>> in jetbrains data grip on that postgresql server, so I don’t think it’s
>>> anything weird with the db or query. Any ideas about what could be causing
>>> this? Even with a high limit like 5m records the query doesn’t lock up the
>>> NiFi node.
>>>
>>> Sent from my iPhone
>>
>>

Re: NiFi hanging during large sql query

2023-09-02 Thread Mike Thomsen

It was the PostgreSQL JDBC driver. If you don't paginate the query
aggressively, it will try to load a significant chunk of the table into
memory rather than just pulling chunks, even with fetchSize set low.

On Fri, Sep 1, 2023 at 6:01 PM Mike Thomsen  wrote:

> I have a three node cluster with an executesqlrecord processor with
> primary execution only. The sql it runs is a straight forward select on a
> table with about 44m records. If I leave it running, after about 10 min the
> node becomes unresponsive and leaves the cluster. The query runs just fine
> in jetbrains data grip on that postgresql server, so I don’t think it’s
> anything weird with the db or query. Any ideas about what could be causing
> this? Even with a high limit like 5m records the query doesn’t lock up the
> NiFi node.
>
> Sent from my iPhone

NiFi hanging during large sql query

2023-09-01 Thread Mike Thomsen

I have a three node cluster with an executesqlrecord processor with primary 
execution only. The sql it runs is a straight forward select on a table with 
about 44m records. If I leave it running, after about 10 min the node becomes 
unresponsive and leaves the cluster. The query runs just fine in jetbrains data 
grip on that postgresql server, so I don’t think it’s anything weird with the 
db or query. Any ideas about what could be causing this? Even with a high limit 
like 5m records the query doesn’t lock up the NiFi node.

Sent from my iPhone

Re: TLSv1.3 SSLContext not available on Java 11 and RHEL8

2023-08-15 Thread Mike Thomsen

I had similar thoughts and told them to start working with different
flavors of Java 11.

Thanks,

Mike

On Tue, Aug 15, 2023 at 10:03 AM David Handermann <
exceptionfact...@apache.org> wrote:

> Mike,
>
> It sounds like the problem could be related to the specific Java vendor
> and version, or related to Java Security settings.
>
> Java 8 Update 261 [1] and following include TLSv1.3, and Java 11 also
> includes TLSv1.3 as you noted. However, the java.security configuration can
> disable specific TLS versions using the jdk.tls.disabledAlgorithms property.
>
> It is possible that a custom java.security configuration disabled TLSv1.3,
> perhaps for compatibility reasons. Checking the java.security configuration
> for the JDK installation would be a good next step for troubleshooting.
>
> Regards,
> David Handermann
>
> [1] https://www.oracle.com/java/technologies/javase/8u261-relnotes.html
>
> [2]
> https://docs.oracle.com/en/java/javase/11/security/java-secure-socket-extension-jsse-reference-guide.html#GUID-0A438179-32A7-4900-A81C-29E3073E1E90
>
> On Tue, Aug 15, 2023 at 8:43 AM Mike Thomsen 
> wrote:
>
>> Roughly copy-pasta: "ERROR o.anifi.security.util.SslContextFactory
>> Encountered an error creating SSLContext from TLSConfiguration
>> [TlsConfiguration]keystorePath.protocol=TLSv1.3): TLSv1.3 SSLContext
>> not available"
>>
>> Can't copy and paste because it's on a client's network.
>>
>> On Tue, Aug 15, 2023 at 9:41 AM Phillip Lord 
>> wrote:
>>
>>> Can you add the error here for more context?
>>> On Aug 15, 2023 at 9:38 AM -0400, Mike Thomsen ,
>>> wrote:
>>>
>>> As the subject line says, we're getting a weird error when trying to
>>> migrate to RHEL8. We're already on Java 11 on RHEL7, but for some reason
>>> NiFi is running into problems instantiating a TLSv1.3 SSLContext.
>>>
>>> Does anyone have any suggestions on what could be happening here?
>>>
>>>

Re: TLSv1.3 SSLContext not available on Java 11 and RHEL8

2023-08-15 Thread Mike Thomsen

Roughly copy-pasta: "ERROR o.anifi.security.util.SslContextFactory
Encountered an error creating SSLContext from TLSConfiguration
[TlsConfiguration]keystorePath.protocol=TLSv1.3): TLSv1.3 SSLContext
not available"

Can't copy and paste because it's on a client's network.

On Tue, Aug 15, 2023 at 9:41 AM Phillip Lord  wrote:

> Can you add the error here for more context?
> On Aug 15, 2023 at 9:38 AM -0400, Mike Thomsen ,
> wrote:
>
> As the subject line says, we're getting a weird error when trying to
> migrate to RHEL8. We're already on Java 11 on RHEL7, but for some reason
> NiFi is running into problems instantiating a TLSv1.3 SSLContext.
>
> Does anyone have any suggestions on what could be happening here?
>
>

TLSv1.3 SSLContext not available on Java 11 and RHEL8

2023-08-15 Thread Mike Thomsen

As the subject line says, we're getting a weird error when trying to
migrate to RHEL8. We're already on Java 11 on RHEL7, but for some reason
NiFi is running into problems instantiating a TLSv1.3 SSLContext.

Does anyone have any suggestions on what could be happening here?

Re: NiFi not rolling logs

2023-07-09 Thread Mike Thomsen

The totalcapsize feature will help a lot from what I’ve seen.Sent from my iPhoneOn Jul 8, 2023, at 9:26 AM, Lars Winderling wrote:
Hi Mike,thanks for the advice. Our NiFi instances are running for week, if not months. Often times until the next update, so the startup option will bring much benefit, I fear, or am I mistaken. But looking forward to 1.23!On 8 July 2023 13:40:15 CEST, Mike Thomsen wrote:
Lars,You should also experiment with cleanHistoryOnStart. I did some experimentation this morning where I set the maxHistory to 1 (1 day vs the default of 30 which is 30 days), created a few fake log files from previous days and NiFi immediately cleared out those "old files" on startup. I have a Jira ticket up to fix this for 1.x and 2.x and will likely have it up today. Should definitely be ready for 1.23On Sat, Jul 8, 2023 at 4:17 AM Lars Winderling <lars.winderl...@posteo.de> wrote:
Dear NiFiers, we have been bugged so much by overflowing logfiles, and nothing has ever helped. I thought it was just my lack of skills...especially when NiFi has some issues and keeps on spilling stacktraces with high frequency to disk, it eats up space quickly. I have created cronjobs that rotate logs every minute iff required, and when almost no space is left, it simply deletes old files. Will try totalCapSize etc. Thank you for the pointers! Best, LarsOn 8 July 2023 09:33:41 CEST, "Jens M. Kofoed" <jmkofoed@gmail.com> wrote:
HiPlease have a look at this old jira: https://issues.apache.org/jira/browse/NIFI-2203I have had issues where a processor create a log message ever 10ms resulting in the disk is being full. For me it seems like the maxHistory settings only effect how many files defined by the rolling patten to be kept. If you have defined it like this:${org.apache.nifi.bootstrap.config.log.dir}/nifi-app%d{-MM-dd}.%i.logMaxHistory only effect the days not the increments file %i per day. So you can stille have thousands of files in one day.The totalSizeCap will delete the oldes files if the total size hits the cap settings.The totalSizeCap have been added in the logback.xml file for nifi-registry where it has been added inside the rollingPolicy section. I cound not get it to work inside the rollingPolicy section in nifi but just added in appender section. See my comment in the jira: https://issues.apache.org/jira/browse/NIFI-2203Kind regardsJens M. KofoedDen lør. 8. jul. 2023 kl. 04.27 skrev Mike Thomsen <mikerthom...@gmail.com>:Yeah, I'm working through some of it where I have time. I plan to have a Jira up this weekend. I'm wondering, though, if we shouldn't consider a spike for switching to log4j2 in 2.X because I saw a lot of complaints about logback being inconsistent in honoring its settings.On Fri, Jul 7, 2023 at 10:19 PM Joe Witt <joe.w...@gmail.com> wrote:H. Interesting. Can you capture these bits of fun in a jira?ThanksOn Fri, Jul 7, 2023 at 7:17 PM Mike Thomsen <mikerthom...@gmail.com> wrote:After doing some research, it appears that is a wonky setting WRT how well it's honored by logback. I let a GenerateFlowFile > LogAttribute flow run for a long time, and it just kept filling up. When I added that appeared to force expected behavior on total log size. We might want to add the following:true50GBOn Fri, Jul 7, 2023 at 11:33 AM Michael Moser <moser...@gmail.com> wrote:Hi Mike,You aren't alone in experiencing this. I think logback uses a pattern matcher on filename to discover files to delete. If "something" happens which causes a gap in the date pattern, then the matcher will then fail to pick up and delete files on the other side of that gap.Regards,-- Mike MOn Thu, Jul 6, 2023 at 10:28 AM Mike Thomsen <mikerthom...@gmail.com> wrote:We are using the stock configuration, and have noticed that we have a lot of nifi-app* logs that are well beyond the historic data cap of 30 days in logback.xml; some of those logs go back to April. We also have a bunch of 0 byte nifi-user logs and some of the other logs are 0 bytes as well. It looks like logback is rotating based on time, but isn't cleaning up. Is this expected behavior or a problem with the configuration?Thanks,Mike

Re: NiFi not rolling logs

2023-07-08 Thread Mike Thomsen

Lars,

You should also experiment with cleanHistoryOnStart. I did some
experimentation this morning where I set the maxHistory to 1 (1 day vs the
default of 30 which is 30 days), created a few fake log files from previous
days and NiFi immediately cleared out those "old files" on startup. I have
a Jira ticket up to fix this for 1.x and 2.x and will likely have it up
today. Should definitely be ready for 1.23

On Sat, Jul 8, 2023 at 4:17 AM Lars Winderling 
wrote:

> Dear NiFiers, we have been bugged so much by overflowing logfiles, and
> nothing has ever helped. I thought it was just my lack of
> skills...especially when NiFi has some issues and keeps on spilling
> stacktraces with high frequency to disk, it eats up space quickly. I have
> created cronjobs that rotate logs every minute iff required, and when
> almost no space is left, it simply deletes old files. Will try totalCapSize
> etc. Thank you for the pointers! Best, Lars
>
>
> On 8 July 2023 09:33:41 CEST, "Jens M. Kofoed" 
> wrote:
>
>> Hi
>>
>> Please have a look at this old jira:
>> https://issues.apache.org/jira/browse/NIFI-2203
>> I have had issues where a processor create a log message ever 10ms
>> resulting in the disk is being full. For me it seems like the maxHistory
>> settings only effect how many files defined by the rolling patten to be
>> kept. If you have defined it like this:
>>
>> ${org.apache.nifi.bootstrap.config.log.dir}/nifi-app%d{-MM-dd}.%i.log
>> MaxHistory only effect the days not the increments file %i per day. So
>> you can stille have thousands of files in one day.
>> The totalSizeCap will delete the oldes files if the total size hits the
>> cap settings.
>>
>> The totalSizeCap have been added in the logback.xml file for
>> nifi-registry where it has been added inside the rollingPolicy section. I
>> cound not get it to work inside the rollingPolicy section in nifi but just
>> added in appender section. See my comment in the jira:
>> https://issues.apache.org/jira/browse/NIFI-2203
>>
>> Kind regards
>> Jens M. Kofoed
>>
>> Den lør. 8. jul. 2023 kl. 04.27 skrev Mike Thomsen <
>> mikerthom...@gmail.com>:
>>
>>> Yeah, I'm working through some of it where I have time. I plan to have a
>>> Jira up this weekend. I'm wondering, though, if we shouldn't consider a
>>> spike for switching to log4j2 in 2.X because I saw a lot of complaints
>>> about logback being inconsistent in honoring its settings.
>>>
>>> On Fri, Jul 7, 2023 at 10:19 PM Joe Witt  wrote:
>>>
>>>> H.  Interesting.  Can you capture these bits of fun in a jira?
>>>>
>>>> Thanks
>>>>
>>>> On Fri, Jul 7, 2023 at 7:17 PM Mike Thomsen 
>>>> wrote:
>>>>
>>>>> After doing some research, it appears that  is a wonky
>>>>> setting WRT how well it's honored by logback. I let a GenerateFlowFile >
>>>>> LogAttribute flow run for a long time, and it just kept filling up. When I
>>>>> added  that appeared to force expected behavior on total 
>>>>> log
>>>>> size. We might want to add the following:
>>>>>
>>>>> true
>>>>> 50GB
>>>>>
>>>>> On Fri, Jul 7, 2023 at 11:33 AM Michael Moser 
>>>>> wrote:
>>>>>
>>>>>> Hi Mike,
>>>>>>
>>>>>> You aren't alone in experiencing this.  I think logback uses a
>>>>>> pattern matcher on filename to discover files to delete.  If "something"
>>>>>> happens which causes a gap in the date pattern, then the matcher will 
>>>>>> then
>>>>>> fail to pick up and delete files on the other side of that gap.
>>>>>>
>>>>>> Regards,
>>>>>> -- Mike M
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 6, 2023 at 10:28 AM Mike Thomsen 
>>>>>> wrote:
>>>>>>
>>>>>>> We are using the stock configuration, and have noticed that we have
>>>>>>> a lot of nifi-app* logs that are well beyond the historic data cap of 30
>>>>>>> days in logback.xml; some of those logs go back to April. We also have a
>>>>>>> bunch of 0 byte nifi-user logs and some of the other logs are 0 bytes as
>>>>>>> well. It looks like logback is rotating based on time, but isn't 
>>>>>>> cleaning
>>>>>>> up. Is this expected behavior or a problem with the configuration?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Mike
>>>>>>>
>>>>>>

Re: NiFi not rolling logs

2023-07-07 Thread Mike Thomsen

Yeah, I'm working through some of it where I have time. I plan to have a
Jira up this weekend. I'm wondering, though, if we shouldn't consider a
spike for switching to log4j2 in 2.X because I saw a lot of complaints
about logback being inconsistent in honoring its settings.

On Fri, Jul 7, 2023 at 10:19 PM Joe Witt  wrote:

> H.  Interesting.  Can you capture these bits of fun in a jira?
>
> Thanks
>
> On Fri, Jul 7, 2023 at 7:17 PM Mike Thomsen 
> wrote:
>
>> After doing some research, it appears that  is a wonky
>> setting WRT how well it's honored by logback. I let a GenerateFlowFile >
>> LogAttribute flow run for a long time, and it just kept filling up. When I
>> added  that appeared to force expected behavior on total log
>> size. We might want to add the following:
>>
>> true
>> 50GB
>>
>> On Fri, Jul 7, 2023 at 11:33 AM Michael Moser  wrote:
>>
>>> Hi Mike,
>>>
>>> You aren't alone in experiencing this.  I think logback uses a pattern
>>> matcher on filename to discover files to delete.  If "something" happens
>>> which causes a gap in the date pattern, then the matcher will then fail to
>>> pick up and delete files on the other side of that gap.
>>>
>>> Regards,
>>> -- Mike M
>>>
>>>
>>> On Thu, Jul 6, 2023 at 10:28 AM Mike Thomsen 
>>> wrote:
>>>
>>>> We are using the stock configuration, and have noticed that we have a
>>>> lot of nifi-app* logs that are well beyond the historic data cap of 30 days
>>>> in logback.xml; some of those logs go back to April. We also have a bunch
>>>> of 0 byte nifi-user logs and some of the other logs are 0 bytes as well. It
>>>> looks like logback is rotating based on time, but isn't cleaning up. Is
>>>> this expected behavior or a problem with the configuration?
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>

NiFi not rolling logs

2023-07-06 Thread Mike Thomsen

We are using the stock configuration, and have noticed that we have a lot
of nifi-app* logs that are well beyond the historic data cap of 30 days in
logback.xml; some of those logs go back to April. We also have a bunch of 0
byte nifi-user logs and some of the other logs are 0 bytes as well. It
looks like logback is rotating based on time, but isn't cleaning up. Is
this expected behavior or a problem with the configuration?

Thanks,

Mike

Re: Out of file descriptors? but not!

2023-07-05 Thread Mike Thomsen

I think with systemd services can have their own handler limits. Check in
the NiFi service config for that. IIRC, we had to set configuration flags
in the NiFi config specifically to force systemd to give that service
unlimited handles.

On Mon, Jul 3, 2023 at 10:31 AM Greene (US), Geoffrey N <
geoffrey.n.gre...@boeing.com> wrote:

>
>
> So, I came back from two weeks vacation…
>
>
>
> My nifi (1.17.0). is misbehaving.
>
>
>
> The logs say
>
> 2023-07-03 07:04:07,668 ERROR [Index Provenance Events-1]
> o.a.n.p.index.lucene.EventIndexTask Failed to index Provenance Events
>
> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
>
> …
>
> Caused by: java.nio.file.FileSystemException:
> /data/nifi-1.17.0/provenance_repository/lucene-8-index-1687383779103/_2y6.kdi:
> Too many open files
>
> at
> java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:100)
>
> …
>
>
>
> I’m out of file descriptors?
>
>
>
> After determining that nifi is running as process id 19655, I see:
>
>
>
> lsof -p 19655 | wc -l
>
> 85
>
>
>
> 85?  That’s really low.
>
>
>
> /etc/sysctl.conf contains
>
> Fs.file-max = 9815744
>
>
>
>
>
> $ ulimit -a
>
> address space limit (Kibytes)  (-M)  unlimited
>
> core file size (blocks)(-c)  0
>
> cpu time (seconds) (-t)  unlimited
>
> data size (Kibytes)(-d)  unlimited
>
> file size (blocks) (-f)  unlimited
>
> locks  (-x)  unlimited
>
> locked address space (Kibytes) (-l)  64
>
> message queue size (Kibytes)   (-q)  800
>
> nice   (-e)  0
>
> nofile (-n)  1048576
>
> nproc  (-u)  4096
>
> pipe buffer size (bytes)   (-p)  4096
>
> max memory size (Kibytes)  (-m)  unlimited
>
> rtprio (-r)  0
>
> socket buffer size (bytes) (-b)  4096
>
> sigpend(-i)  192389
>
> stack size (Kibytes)   (-s)  8192
>
> swap size (Kibytes)(-w)  not supported
>
> threads(-T)  not supported
>
> process size (Kibytes) (-v)  unlimited
>
>
>
> I DO notice one more odd thing:
>
>
>
> I have 0 flow files (though a number of flows running)
>
>
>
> Despite that,  I see that thre repositories aredefinitely taking up some
> space.
>
>
>
> 350Gcontent_repository
>
> 712Kdatabase_repository
>
> 28K flowfile_repository
>
> 10G provenance_repository
>
>
>
>
>
>
>
> Any thoughts? I was thinking about doing an rm –rf on
> provenance_repository…..or maybe decreasing
>
> nifi.provenance.repository.max.storage.time
>
> or increasing
>
> nifi.provenance.repository.max.storage.size (currently set to 10 G)
>
> in nifi.properties.
>
>
>
>
>
>
>
> Geoffrey Greene
>
>
>

Kafka 3 w/ KRaft and ConsumeKafka

2023-06-15 Thread Mike Thomsen

Is ConsumeKafka 2.6 compatible with Kafka 3 running KRaft instead of
ZooKeeper?

Re: Need Help in migrating Giant CSV from S3 to SFTP

2023-05-12 Thread Mike Thomsen

1. No, it needs to be local storage.
2. That would be a question for AWS because you would need a S3 feature
that allows you to tell S3 to open a SFTP connection and manage the
transfer for you.

400GB is really not that big of a deal on a decent AWS setup, though. The
internal transfer rates between S3 and EC2 and EKS are very high.

On Fri, May 12, 2023 at 2:53 AM Kumar, Nilesh via users <
users@nifi.apache.org> wrote:

> Gentle People,
>
>
>
> I understand that we need to have enough space in content repo to transfer
> 400GB kind of CSV. I have two questions:
>
>1. Can this space for content repository be external S3 instead of
>extra volume and I will try to update below properties value.
>2. Is there any way we can directly move data without holding it into
>content repo. Disk I/O of 400GB will be less performant.
>
> I am planning to create a Persistent Volume of 500GB, mount it to my nifi
> server and configure below properties to use that mount path.
>
> nifi.content.repository.directory.default=../content_repository
>
>
>
> *From:* Mike Thomsen 
> *Sent:* Thursday, May 11, 2023 11:17 PM
> *To:* users@nifi.apache.org
> *Subject:* [EXT] Re: Need Help in migrating Giant CSV from S3 to SFTP
>
>
>
> Nilesh,
>
>
>
> The issue is you're running out of space on the disk. Ask your devops team
> to provision a lot more space for that the partition where the content
> repository resides. Adding an extra 500GB should give you more than enough
> space to cover it and a little buffer in case you want to do something else
> with it that mutates the data.
>
>
>
> On Tue, May 9, 2023 at 2:53 PM Joe Witt  wrote:
>
> Nilesh,
>
>
>
> These processors generally are not memory sensitive as they should
> only ever have small amounts in memory at a time so it is likely this
> should work well up to 100s of GB objects and so on.  We of course dont
> really test that much but it is technically reasonable and designed as
> such.  So what would be the bottleneck?  It is exactly what Eric is
> flagging.
>
>
>
> You will need a large content repository available large enough to hold as
> much data in flight as you'll have at any one time.  It looks like you have
> single files as large as 400GB with some being 100s or 10s of GB as well
> and I'm guessing many can happen at/around one time.  So you'll need a far
> larger content repository than you're currently using.  It shows that free
> space on any single node is on average 140GB which means you have very
> little head room for what you're trying to do.  You should try to have a TB
> or more available for this kind of case (per node).
>
>
>
> You mention it fails but please provide information showing how/the logs.
>
>
>
> Also please do not use load balancing on every connection.  You want to
> use that feature selectively/by design choices.  For now - I'd avoid it
> entirely or just use it between listing and fetching.  But certainly not
> after fetching given how massive the content is that would have to be
> shuffled around.
>
>
>
> Thanks
>
>
>
> On Tue, May 9, 2023 at 9:07 AM Kumar, Nilesh via users <
> users@nifi.apache.org> wrote:
>
> Hi Eric
>
>
>
> I see following for my content Repository. Can you please help me on how
> to tweak it further. I have deployed nifi on K8s with 3 replica pod
> cluster, with no limit of resource. But I guess the pod  cpu/memory will be
> throttled by node capacity itself. I noticed that single I have one single
> file as 400GB all the load goes to any one of the node that picks up the
> transfer. I wanted to know if we can do this any other way of configuring
> the flow. If not please tell me the metrics for nifi to tweak.
>
>
>
> *From:* Eric Secules 
> *Sent:* Tuesday, May 9, 2023 9:26 PM
> *To:* users@nifi.apache.org; Kumar, Nilesh 
> *Subject:* [EXT] Re: Need Help in migrating Giant CSV from S3 to SFTP
>
>
>
> Hi Nilesh,
>
>
>
> Check the size of your content repository. If you want to transfer a 400GB
> file through nifi, your content repository must be greater than 400GB,
> someone else might have a better idea of how much bigger you need. But
> generally it all depends on how many of these big files you want to
> transfer at the same time. You can check the content repository metrics in
> the Node Status from the hamburger menu in the top right corner of the
> canvas.
>
>
>
> -Eric
>
>
>
> On Tue., May 9, 2023, 8:42 a.m. Kumar, Nilesh via users, <
> users@nifi.apache.org> wrote:
>
> Hi Team,
>
> I want to move a very large file like 400GB from S3 to SFTP. I have used
> listS3 -> FetchS3 -> putSFTP. This work

Re: Need Help in migrating Giant CSV from S3 to SFTP

2023-05-11 Thread Mike Thomsen

Nilesh,

The issue is you're running out of space on the disk. Ask your devops team
to provision a lot more space for that the partition where the content
repository resides. Adding an extra 500GB should give you more than enough
space to cover it and a little buffer in case you want to do something else
with it that mutates the data.

On Tue, May 9, 2023 at 2:53 PM Joe Witt  wrote:

> Nilesh,
>
> These processors generally are not memory sensitive as they should
> only ever have small amounts in memory at a time so it is likely this
> should work well up to 100s of GB objects and so on.  We of course dont
> really test that much but it is technically reasonable and designed as
> such.  So what would be the bottleneck?  It is exactly what Eric is
> flagging.
>
> You will need a large content repository available large enough to hold as
> much data in flight as you'll have at any one time.  It looks like you have
> single files as large as 400GB with some being 100s or 10s of GB as well
> and I'm guessing many can happen at/around one time.  So you'll need a far
> larger content repository than you're currently using.  It shows that free
> space on any single node is on average 140GB which means you have very
> little head room for what you're trying to do.  You should try to have a TB
> or more available for this kind of case (per node).
>
> You mention it fails but please provide information showing how/the logs.
>
> Also please do not use load balancing on every connection.  You want to
> use that feature selectively/by design choices.  For now - I'd avoid it
> entirely or just use it between listing and fetching.  But certainly not
> after fetching given how massive the content is that would have to be
> shuffled around.
>
> Thanks
>
> On Tue, May 9, 2023 at 9:07 AM Kumar, Nilesh via users <
> users@nifi.apache.org> wrote:
>
>> Hi Eric
>>
>>
>>
>> I see following for my content Repository. Can you please help me on how
>> to tweak it further. I have deployed nifi on K8s with 3 replica pod
>> cluster, with no limit of resource. But I guess the pod  cpu/memory will be
>> throttled by node capacity itself. I noticed that single I have one single
>> file as 400GB all the load goes to any one of the node that picks up the
>> transfer. I wanted to know if we can do this any other way of configuring
>> the flow. If not please tell me the metrics for nifi to tweak.
>>
>>
>>
>> *From:* Eric Secules 
>> *Sent:* Tuesday, May 9, 2023 9:26 PM
>> *To:* users@nifi.apache.org; Kumar, Nilesh 
>> *Subject:* [EXT] Re: Need Help in migrating Giant CSV from S3 to SFTP
>>
>>
>>
>> Hi Nilesh,
>>
>>
>>
>> Check the size of your content repository. If you want to transfer a
>> 400GB file through nifi, your content repository must be greater than
>> 400GB, someone else might have a better idea of how much bigger you need.
>> But generally it all depends on how many of these big files you want to
>> transfer at the same time. You can check the content repository metrics in
>> the Node Status from the hamburger menu in the top right corner of the
>> canvas.
>>
>>
>>
>> -Eric
>>
>>
>>
>> On Tue., May 9, 2023, 8:42 a.m. Kumar, Nilesh via users, <
>> users@nifi.apache.org> wrote:
>>
>> Hi Team,
>>
>> I want to move a very large file like 400GB from S3 to SFTP. I have used
>> listS3 -> FetchS3 -> putSFTP. This works for smaller files till 30GB but
>> fails for larger(100GB) files. Is there any way to configure this flow so
>> that it handles very large single file. If there is any template that
>> exists please share.
>>
>> My configuration are all standard processor configuration.
>>
>>
>>
>> Thanks,
>>
>> Nilesh
>>
>>
>>
>>
>>
>>
>>
>> This message (including any attachments) contains confidential
>> information intended for a specific individual and purpose, and is
>> protected by law. If you are not the intended recipient, you should delete
>> this message and any disclosure, copying, or distribution of this message,
>> or the taking of any action based on it, by you is strictly prohibited.
>>
>> Deloitte refers to a Deloitte member firm, one of its related entities,
>> or Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is
>> a separate legal entity and a member of DTTL. DTTL does not provide
>> services to clients. Please see www.deloitte.com/about to learn more.
>>
>> v.E.1
>>
>>

How to debug why a node isn't rejoining a cluster

2023-05-02 Thread Mike Thomsen

I have a node in my dev cluster that is disconnected and won't reconnect.
I'm tailing the nifi-app.log while trying to reconnect it, but nothing's
getting logged. The logback configuration is standard for NiFi AFAIK. Are
there any additional loggers that need to be enabled to figure out why it's
not reconnecting? It seems like it's just silently failing.

Thanks,

Mike

Re: [EXTERNAL] Re: json into a json-enabled DB

2022-12-19 Thread Mike Thomsen

UpdateRecord should be useful for that. If you use escapeJson, you can
create an escaped JSON string from the result of a record path
operation.

On Fri, Dec 16, 2022 at 11:22 AM Greene (US), Geoffrey N
 wrote:
>
> Yeah, I was able to get json into the db it by using strings.
>
> Unfortunately, I have some escape characters in my strings, and it looks like 
> I have to escape my escapes. Which ends up being either a few text processors 
> or a groovy script.
>
> To paraphrase the meme, "yo dawg I hear you like escapes with your escapes..."
>
> But you are correct, I was able to make it happen. I was just hoping for 
> something a little more record-oriented (or something).  I guess if it works, 
> don't complain...
>
> -Original Message-
> From: Mike Thomsen [mailto:mikerthom...@gmail.com]
> Sent: Friday, December 16, 2022 10:59 AM
> To: users@nifi.apache.org
> Subject: [EXTERNAL] Re: json into a json-enabled DB
>
> EXT email: be mindful of links/attachments.
>
>
>
> To Matt's point, I've tested insert by doing a record field of String going 
> to JSON/JSONB in Postgres and MySQL, and that worked just fine.
> I'm not sure if we're at a point where we can do a reader with one schema and 
> a writer with another schema, but it should be pretty straight forward to fix 
> so that worst case scenario that is ConvertRecord -> PutDatabaseRecord
>
> On Thu, Dec 15, 2022 at 10:21 PM Matt Burgess  wrote:
> >
> > Geoffrey,
> >
> > The biggest problem with JSON columns across the board is that the
> > JDBC and java.sql.Types specs don't handle them natively, and NiFi
> > records don't recognize JSON as a particular type, we are only
> > interested in the overall datatype such as String since NiFi records
> > can be in any supported format. In my experience these are handled by
> > setting the JSON column to type java.sql.OTHER (like PostgreSQL) and
> > they are willing to accept the value as a String (see NIFI-5901 [1]),
> > and we put in code to handle it as such (see NIFI-5845 [2]). For NiFi
> > it's been more of an ad-hoc type of support where maybe if the SQL
> > type is custom and unique we can handle such things (like sql_variant
> > in MSSQL via NIFI-5819 [3]), but due to the nature of the custom type
> > it's difficult to handle in any sort of consistent way. Happy to hear
> > your thoughts and input, perhaps we can add some ad-hoc support for
> > your use case?
> >
> > Regards,
> > Matt
> >
> > [1] https://issues.apache.org/jira/browse/NIFI-5901
> > [2] https://issues.apache.org/jira/browse/NIFI-5845
> > [3] https://issues.apache.org/jira/browse/NIFI-5819
> >
> > On Wed, Dec 14, 2022 at 3:55 PM Greene (US), Geoffrey N
> >  wrote:
> > >
> > > Some databases (postgres, sql server,  others) support native json 
> > > columns.
> > >
> > > With postgres, there’s a native jsonb type, with sql server it’s a string 
> > > type, that you can treat as json.
> > >
> > >
> > >
> > > In any event, once you have the json in the database, one can then query 
> > > it, e.g.:
> > >
> > >
> > >
> > > SELECT id,product_name,
> > >
> > >JSON_VALUE(attributes, '$.material') AS material
> > >
> > > FROM jsontest;
> > >
> > >
> > >
> > > So, here’s my question:
> > >
> > >
> > >
> > > If you have a flow file that contains json, whats the best way to insert 
> > > that into a database?
> > >
> > > The only thing I’ve thought of so far is if you have the json string
> > >
> > > {“material” : “plastic”}
> > >
> > > You then use a TEXT processor to turn that into
> > >
> > > {“attributes”: {‘{“material” : “plastic”}’}
> > >
> > > And then use a PutDatabaseRecord to actually write the entry.
> > >
> > >
> > >
> > > Is there a better, or more efficient way to do it?
> > >
> > >
> > >
> > >
> > >
> > >
>

Re: json into a json-enabled DB

2022-12-16 Thread Mike Thomsen

To Matt's point, I've tested insert by doing a record field of String
going to JSON/JSONB in Postgres and MySQL, and that worked just fine.
I'm not sure if we're at a point where we can do a reader with one
schema and a writer with another schema, but it should be pretty
straight forward to fix so that worst case scenario that is
ConvertRecord -> PutDatabaseRecord

On Thu, Dec 15, 2022 at 10:21 PM Matt Burgess  wrote:
>
> Geoffrey,
>
> The biggest problem with JSON columns across the board is that the
> JDBC and java.sql.Types specs don't handle them natively, and NiFi
> records don't recognize JSON as a particular type, we are only
> interested in the overall datatype such as String since NiFi records
> can be in any supported format. In my experience these are handled by
> setting the JSON column to type java.sql.OTHER (like PostgreSQL) and
> they are willing to accept the value as a String (see NIFI-5901 [1]),
> and we put in code to handle it as such (see NIFI-5845 [2]). For NiFi
> it's been more of an ad-hoc type of support where maybe if the SQL
> type is custom and unique we can handle such things (like sql_variant
> in MSSQL via NIFI-5819 [3]), but due to the nature of the custom type
> it's difficult to handle in any sort of consistent way. Happy to hear
> your thoughts and input, perhaps we can add some ad-hoc support for
> your use case?
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/NIFI-5901
> [2] https://issues.apache.org/jira/browse/NIFI-5845
> [3] https://issues.apache.org/jira/browse/NIFI-5819
>
> On Wed, Dec 14, 2022 at 3:55 PM Greene (US), Geoffrey N
>  wrote:
> >
> > Some databases (postgres, sql server,  others) support native json columns.
> >
> > With postgres, there’s a native jsonb type, with sql server it’s a string 
> > type, that you can treat as json.
> >
> >
> >
> > In any event, once you have the json in the database, one can then query 
> > it, e.g.:
> >
> >
> >
> > SELECT id,product_name,
> >
> >JSON_VALUE(attributes, '$.material') AS material
> >
> > FROM jsontest;
> >
> >
> >
> > So, here’s my question:
> >
> >
> >
> > If you have a flow file that contains json, whats the best way to insert 
> > that into a database?
> >
> > The only thing I’ve thought of so far is if you have the json string
> >
> > {“material” : “plastic”}
> >
> > You then use a TEXT processor to turn that into
> >
> > {“attributes”: {‘{“material” : “plastic”}’}
> >
> > And then use a PutDatabaseRecord to actually write the entry.
> >
> >
> >
> > Is there a better, or more efficient way to do it?
> >
> >
> >
> >
> >
> >

Re: Customizing NiFi in a Docker Container on EC2

2022-11-11 Thread Mike Thomsen

If you're just slapping Docker on the box and not trying something
fancy like running it on EKS, I would just switch to running NiFi
directly. That allows you to use something like Chef or Ansible to
manage your configurations which is a lot more powerful than manually
tweaking Docker.

On Fri, Nov 11, 2022 at 6:20 AM James McMahon  wrote:
>
> The NiFi System Administration Guide makes many recommendations for 
> configuration changes to optimize nifi performance. These "best practices", 
> for example:
> * 
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#configuration-best-practices
> Placement of repos on separate disk devices is another big one; here is an 
> example:
> * 
> https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#content-repository
>
> I have nifi installed in a docker container on an EC2 instance.
>
> As far as I can tell, it is not optimized and I am writing today to ask if 
> anyone has done that with success? If so, I'd like to learn more about that..
>
>
> I use the command
> docker exec -it nifi /bin/bash
> to review what I understand to be the nifi directories and config files in
> the container.
>
> I notice nifi in the container is in many ways not configured to Apache
> NiFi recommendations for optimal performance. For example, the
> nifi.properties repo params for the docker installation all show repos placed 
> on one common disk
> device (the one where the container lives, presumably).
>
> I've configured external ebs volumes that I've mounted on my instance. My 
> intention: one
> for content_repository, one for flowfile_repository, and likewise for
> database and provenance repositories. I'd like to have the containerized
> nifi write to and read from those so that I don't bottleneck performance
> reading and writing to the same device for repos.
>
> I need to persist my changes to nifi config files. How does one avoid
> making changes in nifi.properties and the like that are lost when the
> docker container is stopped, deleted, and a new one instantiated?
>
> I need to engage with external repo resources when nifi runs within my 
> container.
> How do we direct nifi in the container to use those external resources
> outside of the container to host content_repository, etc etc?
>
> Thank you in advance for any help.
> Jim

Re: NiFi on AWS EC2

2022-11-08 Thread Mike Thomsen

> won't render if you're using Edge or IE to access.

The old, discontinued Edge had that problem, but Edge has worked just
fine for the last 2 years or so since it was redone on top of
Chromium.

IMO if you're going to use Chromium with NiFi, the best experience
seems to be Brave.

On Tue, Nov 8, 2022 at 4:21 PM Patrick Timmins  wrote:
>
> In addition to the other suggestions, the last time I checked, the HTML5
> of the NiFi interface won't render if you're using Edge or IE to
> access.  Brave, Chrome, Firefox etc will work, however.
>
> On 11/8/2022 1:53 PM, James McMahon wrote:
> > Has anyone successfully configured NiFi on AWS, and accessed it from a
> > browser on a Windows desktop? I’ve tried following a few links to do
> > this. I’ve verified that my instance security group allows access to
> > 8080 via its inbound rules. I’ve putty’ed into the instance via ssh
> > port 22 to verify that there are no firewall restrictions. But still I
> > get a message to the effect that the server rejected the connection
> > request. Can anyone recommend a link that describes a success path for
> > this?
> > Thanks in advance for your help.
> > Jim

Re: DistributedMapCacheServer

2022-11-04 Thread Mike Thomsen

Perhaps I'm mistaken, but ZK is designed for managing configuration
data and not the sort of large scale key/value lookup that is implied
with DistributedMapCache implementations.

On Fri, Nov 4, 2022 at 6:28 AM ta.fiat.belastingdienst.nl via users
 wrote:
>
>
> Hello,
>
> I'm investigating Redis, seems to work easy.
>
> I think it is a bit strange that Zookeeper is not in de list with providers. 
> Zookeeper is already used as cluster manager for Nifi, that would be easy to 
> add in my opinion.
>
> regards,
>
> Tiemen.
>
>
>
> - Oorspronkelijk bericht -
> Van: "Mike Thomsen" 
> Aan: users@nifi.apache.org
> Cc:
> Onderwerp: Re: DistributedMapCacheServer
> Datum: do 3 nov. 2022 15:58
>
> [EXTERNE E-MAIL] Dit bericht is afkomstig van een externe afzender. Wees 
> voorzichtig met het openen van linkjes en bijlagen.
>
>
> You can also use the Cassandra DMC, which is something we are starting
> to use a lot where I work.
>
> Admittedly, the documentation is non-existent at the moment, but we
> also open sourced an experimental delegating DMC client that can be
> used to chain multiple DMCs together so you can do Redis for hot
> caching and something like Cassandra for broader cold caches
>
> https://github.com/Domestic-Resilience-FOSS/nifi-delegating-distributedmapcache-bundle
>
> On Wed, Nov 2, 2022 at 5:25 PM Peter Turcsanyi  wrote:
> >
> > Embedded Hazelcast can also be an option. In that case, there is no
> > need to set up an external cache but the Hazelcast instances are
> > running on the NiFi nodes (in the same JVM as NiFi).
> > Please note: no security/authentication is supported in embedded mode.
> >
> > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hazelcast-services-nar/1.18.0/org.apache.nifi.hazelcast.services.cacheclient.HazelcastMapCacheClient/index.html
> > https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hazelcast-services-nar/1.18.0/org.apache.nifi.hazelcast.services.cachemanager.EmbeddedHazelcastCacheManager/index.html
> >
> > On Wed, Nov 2, 2022 at 5:10 PM Bryan Bende  wrote:
> > >
> > > The DMC Server does not support high availability. If you have a 3
> > > node nifi cluster, each node will have a DMC client and DMC server,
> > > but the clients all have to point at only one of the servers, and if
> > > the node where that server is running goes down, there is no
> > > replication or failover to another node. So it is really up to you to
> > > decide if that is acceptable for your use case. If its not then you
> > > need to use a different DMC client implementation that can communicate
> > > with an external HA cache, like Redis.
> > >
> > > On Wed, Nov 2, 2022 at 11:38 AM Greene (US), Geoffrey N
> > >  wrote:
> > > >
> > > > I make heavy use of DistributedMapCacheServer in my nifi flows (one 
> > > > node; not clustered).
> > > >
> > > >
> > > >
> > > > I seem to remember reading that the DistributedMapCacheServer is not to 
> > > > be used in production; it’s a reference implementation only, and it is 
> > > > not really recommended for production.
> > > >
> > > >
> > > >
> > > > Unfortunately, I can no longer find the reference saying that 
> > > > DistributedMapCacheServer is not trustworthy for prod.
> > > >
> > > > I don’t have an HDFS implementation anywhere, but I do need the 
> > > > cacheing part.
> > > >
> > > >
> > > >
> > > > Can someone explain?  Can I use DistributedMapCacheServer in my 
> > > > production flows?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
>
>
>
>
> 
> De Belastingdienst stelt e-mail niet open voor aanvragen, aangiften, 
> bezwaarschriften, verzoeken, klachten, ingebrekestellingen en soortgelijke 
> formele berichten.
> Dit bericht is uitsluitend bestemd voor de geadresseerde. Het bericht kan 
> vertrouwelijke informatie bevatten waarvoor de fiscale geheimhoudingsplicht 
> geldt. Als u dit bericht per abuis hebt ontvangen, wordt u verzocht het te 
> verwijderen en de afzender te informeren.
>
> The Dutch Tax Administration does not accept filings, requests, appeals, 
> complaints, notices of default or similar formal notices, sent by email.
> This message is solely intended for the addressee. It may contain information 
> that is confidential and legally privileged. If you are not the intended 
> recipient please delete this message and notify the sender.

Re: DistributedMapCacheServer

2022-11-03 Thread Mike Thomsen

You can also use the Cassandra DMC, which is something we are starting
to use a lot where I work.

Admittedly, the documentation is non-existent at the moment, but we
also open sourced an experimental delegating DMC client that can be
used to chain multiple DMCs together so you can do Redis for hot
caching and something like Cassandra for broader cold caches

https://github.com/Domestic-Resilience-FOSS/nifi-delegating-distributedmapcache-bundle

On Wed, Nov 2, 2022 at 5:25 PM Peter Turcsanyi  wrote:
>
> Embedded Hazelcast can also be an option. In that case, there is no
> need to set up an external cache but the Hazelcast instances are
> running on the NiFi nodes (in the same JVM as NiFi).
> Please note: no security/authentication is supported in embedded mode.
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hazelcast-services-nar/1.18.0/org.apache.nifi.hazelcast.services.cacheclient.HazelcastMapCacheClient/index.html
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hazelcast-services-nar/1.18.0/org.apache.nifi.hazelcast.services.cachemanager.EmbeddedHazelcastCacheManager/index.html
>
> On Wed, Nov 2, 2022 at 5:10 PM Bryan Bende  wrote:
> >
> > The DMC Server does not support high availability. If you have a 3
> > node nifi cluster, each node will have a DMC client and DMC server,
> > but the clients all have to point at only one of the servers, and if
> > the node where that server is running goes down, there is no
> > replication or failover to another node. So it is really up to you to
> > decide if that is acceptable for your use case. If its not then you
> > need to use a different DMC client implementation that can communicate
> > with an external HA cache, like Redis.
> >
> > On Wed, Nov 2, 2022 at 11:38 AM Greene (US), Geoffrey N
> >  wrote:
> > >
> > > I make heavy use of DistributedMapCacheServer in my nifi flows (one node; 
> > > not clustered).
> > >
> > >
> > >
> > > I seem to remember reading that the DistributedMapCacheServer is not to 
> > > be used in production; it’s a reference implementation only, and it is 
> > > not really recommended for production.
> > >
> > >
> > >
> > > Unfortunately, I can no longer find the reference saying that 
> > > DistributedMapCacheServer is not trustworthy for prod.
> > >
> > > I don’t have an HDFS implementation anywhere, but I do need the cacheing 
> > > part.
> > >
> > >
> > >
> > > Can someone explain?  Can I use DistributedMapCacheServer in my 
> > > production flows?
> > >
> > >
> > >
> > >
> > >
> > >

Re: nifi-api with a server secured with Microsoft AD

2022-10-29 Thread Mike Thomsen

David,

Another option you might want to explore is having AD generate client
certificates for your users.

On Sat, Oct 29, 2022 at 12:01 PM Shawn Weeks  wrote:
>
> NiFi should always accept a cert at the rest api if you provide one. If your 
> using curl just add the “--key” and “--cert” and call whatever api url your 
> trying directly. You’ll need to make sure that the cert your using is signed 
> by the same local CA that NiFi is set to trust and that you’ve added a user 
> in NiFi that matches the common name on the cert or whatever regex you set 
> for “nifi.security.identity.mapping.value.pattern”
>
> Thanks
> Shawn
>
> > On Oct 28, 2022, at 3:55 PM, David Early via users  
> > wrote:
> >
> > Hi all,
> >
> > We have a 3 node cluster secured with Microsort AD for the first time.
> >
> > I need access to the REST api.  The nifi-api/access/token does not work in 
> > this case.
> >
> > We did use a local CA for certificate generation on the servers.
> >
> > I am reading that it is possible to do certificate based auth to the 
> > apiwe need this in a script (python) to run on a remote server which is 
> > checking for old flowfiles that can get stuck in a few places.
> >
> > Can I use cert based API connection when using AD as the main 
> > authentication/authorization for the ui?
> >
> > Anything special that needs to be done?  I've just not used certs with the 
> > api before, but we have used cert based site to site on other systems and 
> > it works fine.  Just not sure how to do it with nipyapi or just from curl 
> > on the cli.
> >
> > David
>

Re: Can ExecuteStreamCommand do this?

2022-09-30 Thread Mike Thomsen

The reason I chose gzip over others is because it's very fast at
compression and decompression while also being pretty solid at the
file sizes it generates. Definitely not as good as bzip2 or 7zip, but
as I said, S3 is very cheap. We'd rather spend a little more on S3
than spend more on extra CPU and memory for our NiFi codes; with
decent IOPs configuration on the EC2 servers' disks, NiFi flies
through our gzipped data faster than it would something much heavier
like 7zip.

On Fri, Sep 30, 2022 at 9:01 AM James McMahon  wrote:
>
> Mike, let me make sure I understand this. Gzip outputs gz files that have 
> some reasonable level of compression. Because NiFi natively handles gzip 
> compressed files - presumably .gz extensions and some associated mime.type - 
> that is good enough for your purposes. You avoid 7za compression because NiFi 
> doesn't handle such compressed files natively, and because the gain in 
> compression is of little utility when S3 storage comes so cheaply; gzip 
> results are good enough.
> Is that the gist of it?
>
> On Fri, Sep 30, 2022 at 8:27 AM Mike Thomsen  wrote:
>>
>> I don't know what your use case is, but we avoid anything beyond gzip
>> because S3 is so cheap.
>>
>> On Thu, Sep 29, 2022 at 10:51 AM James McMahon  wrote:
>> >
>> > Thank you Mark. Had no idea there was this file-based dependency to 7z 
>> > files. Since my workaround appears to be working I think I may just move 
>> > forward with that.
>> > Steve, Mark - thank you again for replying.
>> > Jim
>> >
>> > On Thu, Sep 29, 2022 at 9:15 AM Mark Payne  wrote:
>> >>
>> >> It’s been a while. But if I remember correctly, the reason that NiFi does 
>> >> not natively support 7-zip format is that with 7-zip, the dictionary is 
>> >> written at the end of the file.
>> >> So when data is compressed, the dictionary is built up during compression 
>> >> and written at the end. This makes sense from a compression standpoint.
>> >> However, what it means, is that in order to decompress it, you must first 
>> >> jump to the end of the file in order to access the dictionary. Then jump 
>> >> back to the beginning of the file in order to perform the decompression.
>> >> NiFi makes use of Input Streams and Output Streams for FlowFIle access - 
>> >> it doesn’t provide a File-based approach. And this ability to jump to the 
>> >> end, read the dictionary, and then jump back to the beginning isn’t 
>> >> really possible with Input/Output Streams - at least, not without 
>> >> buffering everything into memory.
>> >>
>> >> So it would make sense that there would be a “Not Implemented” error when 
>> >> attempting to do the same thing using the 7-zip application directly, 
>> >> when attempting to use input streams & output streams.
>> >> I think that if you’re stuck with 7-zip, your own option will be to do 
>> >> what you’re doing - write the data out as a file, run the 7-zip 
>> >> application against that file, writing the output to some directory, and 
>> >> then picking up the files from that directory.
>> >> The alternative, of course, would be to update the source so that it’s 
>> >> creating zip files instead of 7-zip files, if you have sway over the 
>> >> source producer.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >>
>> >> On Sep 29, 2022, at 8:58 AM, stephen.hindmarch.bt.com via users 
>> >>  wrote:
>> >>
>> >> James,
>> >>
>> >> E_NOTIMPL means that feature is not implemented. I can see there is 
>> >> discussion about this down at sourceforge but the detail is blocked by my 
>> >> employer’s firewall.
>> >>
>> >> p7zip / Discussion / Help: E_NOTIMPL for stdin / stdout pipe
>> >>
>> >> https://sourceforge.net/p/p7zip/discussion/383044/thread/8066736d
>> >>
>> >> Steve Hindmarch
>> >>
>> >> From: James McMahon 
>> >> Sent: 29 September 2022 12:12
>> >> To: Hindmarch,SJ,Stephen,VIR R 
>> >> Cc: users@nifi.apache.org
>> >> Subject: Re: Can ExecuteStreamCommand do this?
>> >>
>> >> I ran with these Command Arguments in the ExecuteStreamCommand 
>> >> configuration:
>> >> x;-si;-so;-spf;-aou
>> >> ${filename} removed, -si indicating use of STDIN, -so STDOUT.
>> >>
>> >> The same error is thrown by 7z through Exe

Re: Can ExecuteStreamCommand do this?

2022-09-30 Thread Mike Thomsen

I don't know what your use case is, but we avoid anything beyond gzip
because S3 is so cheap.

On Thu, Sep 29, 2022 at 10:51 AM James McMahon  wrote:
>
> Thank you Mark. Had no idea there was this file-based dependency to 7z files. 
> Since my workaround appears to be working I think I may just move forward 
> with that.
> Steve, Mark - thank you again for replying.
> Jim
>
> On Thu, Sep 29, 2022 at 9:15 AM Mark Payne  wrote:
>>
>> It’s been a while. But if I remember correctly, the reason that NiFi does 
>> not natively support 7-zip format is that with 7-zip, the dictionary is 
>> written at the end of the file.
>> So when data is compressed, the dictionary is built up during compression 
>> and written at the end. This makes sense from a compression standpoint.
>> However, what it means, is that in order to decompress it, you must first 
>> jump to the end of the file in order to access the dictionary. Then jump 
>> back to the beginning of the file in order to perform the decompression.
>> NiFi makes use of Input Streams and Output Streams for FlowFIle access - it 
>> doesn’t provide a File-based approach. And this ability to jump to the end, 
>> read the dictionary, and then jump back to the beginning isn’t really 
>> possible with Input/Output Streams - at least, not without buffering 
>> everything into memory.
>>
>> So it would make sense that there would be a “Not Implemented” error when 
>> attempting to do the same thing using the 7-zip application directly, when 
>> attempting to use input streams & output streams.
>> I think that if you’re stuck with 7-zip, your own option will be to do what 
>> you’re doing - write the data out as a file, run the 7-zip application 
>> against that file, writing the output to some directory, and then picking up 
>> the files from that directory.
>> The alternative, of course, would be to update the source so that it’s 
>> creating zip files instead of 7-zip files, if you have sway over the source 
>> producer.
>>
>> Thanks
>> -Mark
>>
>>
>> On Sep 29, 2022, at 8:58 AM, stephen.hindmarch.bt.com via users 
>>  wrote:
>>
>> James,
>>
>> E_NOTIMPL means that feature is not implemented. I can see there is 
>> discussion about this down at sourceforge but the detail is blocked by my 
>> employer’s firewall.
>>
>> p7zip / Discussion / Help: E_NOTIMPL for stdin / stdout pipe
>>
>> https://sourceforge.net/p/p7zip/discussion/383044/thread/8066736d
>>
>> Steve Hindmarch
>>
>> From: James McMahon 
>> Sent: 29 September 2022 12:12
>> To: Hindmarch,SJ,Stephen,VIR R 
>> Cc: users@nifi.apache.org
>> Subject: Re: Can ExecuteStreamCommand do this?
>>
>> I ran with these Command Arguments in the ExecuteStreamCommand configuration:
>> x;-si;-so;-spf;-aou
>> ${filename} removed, -si indicating use of STDIN, -so STDOUT.
>>
>> The same error is thrown by 7z through ExecuteStreamCommand: Executable 
>> command /bin/7za ended in an error: ERROR: Can not open the file as an 
>> archive  E_NOTIMPL
>>
>> I tried this at the command line, getting the same failure:
>> cat testArchive.7z | 7za x -si -so | dd of=stooges.txt
>>
>>
>> On Thu, Sep 29, 2022 at 6:44 AM James McMahon  wrote:
>>
>> Good morning, Steve. Indeed, that second paragraph is exactly how I did get 
>> this to work. I unpack to disk and then read in the twelve results using a 
>> GetFile. So far it is working well. It just feels a little wrong to me to do 
>> this, as I have introduced an extra write to and read from disk, which is 
>> going to be slower than doing it all in memory within the JVM. While that 
>> may not seem like anything significant for a single 7z file, as we work 
>> across thousands and thousands it can be significant.
>>
>> I am about to try what you suggested above: dropping the ${filename} 
>> entirely from the STDIN / STDOUT configuration. I realize it is not likely 
>> going to give me the twelve output flowfiles I'm seeking in the "output 
>> stream" path from ExecuteStreamCommand. I just want to see if it works 
>> without throwing that error.
>>
>> Welcome any other thoughts or comments you may have. Thanks again for your 
>> comments so far.
>>
>> Jim
>>
>> On Thu, Sep 29, 2022 at 5:23 AM  wrote:
>>
>> James,
>>
>> I have been thinking more about your problem and this may be the wrong 
>> approach. If you successfully unpack your files into the flow file content, 
>> you will still have one output flow file containing the unpacked contents of 
>> all of your files. If you need 12 separate files in their own flowfiles then 
>> you will need to find some way of splitting them up. Is there a byte 
>> sequence you can use in a SplitContent process, or a specific file length 
>> you can use in SplitText?
>>
>> Otherwise you may be better off using ExecuteStreamCommand to unpack the 
>> files on disk. Run it verbosely and use the output of that step to create a 
>> list of the locations where your recently unpacked files are. Or create a 
>> temporary directory to unpack in and fetch all the files

Re: ExecuteStreamCommand fails to extract archived files

2022-09-21 Thread Mike Thomsen

> ExecuteStreamCommand works on the contents of the incoming flowfile, is that 
> understanding correct?

7za can't read the file from stdin. That's the problem AFAICT in your scenario.

On Wed, Sep 21, 2022 at 11:26 AM James McMahon  wrote:
>
> Thank you Mike. May I ask a few follow-up Qs after trying this and failing 
> still?
>
> ExecuteStreamCommand works on the contents of the incoming flowfile, is that 
> understanding correct? If so, then why does it matter where the file sits on 
> the filesystem if it will apply /bin/7za to the flowfile in the stream?
>
> So I have /bin/7za in the /bin directory, it's an executable program, and the 
> user that the nif jvm is running as - user named nifi - has /bin in its path.
>
> I have an archive file I created in directory /mnt/in, and it is named 
> testArchive.7z. I am successfully able to read that archive file in with a 
> ListFile / FetchFile, and do get it in my stream. These are its attributes:
> absolute.path   /mnt/in/
> filename   testArchive.7z
>
> Is this java io exception telling us that it can't find the /bin/7za program, 
> or it can't find the data itself? And if ExecuteStreamCommand is supposed to 
> be applying that command to the flowfile in the stream, why is it important 
> that the archive file exists on disk where ExecuteStreamCommand can find it?
>
> On Wed, Sep 21, 2022 at 11:07 AM Mike Thomsen  wrote:
>>
>> To do this, you need to do UpdateAttribute (to set the temp folder
>> location) -> PutFile -> ExecuteStreamCommand to ensure the flowfile's
>> contents are staged where 7za can find them.
>>
>> I think the appropriate parameter would be something like this:
>>
>> Command Arguments: e;${path}/${filename}
>>
>> Assuming ";" is the argument delimiter.
>>
>> On Wed, Sep 21, 2022 at 10:45 AM James McMahon  wrote:
>> >
>> > Hello. I have a program /bin/7za that I need to apply to flowfiles  that 
>> > were created by 7za. One of them is testArchive.7z.
>> >
>> > I try to employ an ExecuteStreamCommand to extract from an incoming 
>> > flowfile to into N output flowfiles in output stream, each representing 
>> > one file from the contents in the flowfile.
>> >
>> > ESC throws error=2, No such file or directory.
>> >
>> > java.io.Exception: Cannot run program "/bin/7za"": error=2, No such file 
>> > or directory
>> >
>> > My ExecuteStreamCommand processor has this configuration:
>> > Command Argumentse
>> > Command Path   /bin/7za
>> > Ignore STDIN   false
>> > working Directory   no value set
>> > Argument Delimiter   ;
>> > (I do not set an Output Destination Delimiter, intending to send the 
>> > output to output path "output stream" as separate flowfiles)
>> >
>> > How can I fix this problem?
>> >
>> > Thanks in advance,
>> > Jim

Re: ExecuteStreamCommand fails to extract archived files

2022-09-21 Thread Mike Thomsen

To do this, you need to do UpdateAttribute (to set the temp folder
location) -> PutFile -> ExecuteStreamCommand to ensure the flowfile's
contents are staged where 7za can find them.

I think the appropriate parameter would be something like this:

Command Arguments: e;${path}/${filename}

Assuming ";" is the argument delimiter.

On Wed, Sep 21, 2022 at 10:45 AM James McMahon  wrote:
>
> Hello. I have a program /bin/7za that I need to apply to flowfiles  that were 
> created by 7za. One of them is testArchive.7z.
>
> I try to employ an ExecuteStreamCommand to extract from an incoming flowfile 
> to into N output flowfiles in output stream, each representing one file from 
> the contents in the flowfile.
>
> ESC throws error=2, No such file or directory.
>
> java.io.Exception: Cannot run program "/bin/7za"": error=2, No such file or 
> directory
>
> My ExecuteStreamCommand processor has this configuration:
> Command Argumentse
> Command Path   /bin/7za
> Ignore STDIN   false
> working Directory   no value set
> Argument Delimiter   ;
> (I do not set an Output Destination Delimiter, intending to send the output 
> to output path "output stream" as separate flowfiles)
>
> How can I fix this problem?
>
> Thanks in advance,
> Jim

Re: StandardOauth2AccessTokenProvider gets "token not active"

2022-09-06 Thread Mike Thomsen

Are you by any chance running Keycloak?

On Mon, Aug 29, 2022 at 4:03 AM Jens M. Kofoed
 wrote:
>
> Hi community
>
> I'm using the StandardOauth2AccessTokenProvider to get and refresh a token, 
> which works great. But almost at every refresh, one of the nodes in the 
> cluster gets this error. It's not the same node which gets the error every 
> time, all nodes gets it but only one node at a time.
>
> 2022-08-29 06:14:28,081 ERROR [Timer-Driven Process Thread-4] 
> org.apache.nifi.oauth2.StandardOauth2AccessTokenProvider 
> StandardOauth2AccessTokenProvider[id=861dbfea-0181-1000--d19b4cf0] 
> OAuth2 access token request failed [HTTP 400], response:
> {"error":"invalid_grant","error_description":"Token is not active"}
> 2022-08-29 06:14:28,082 INFO [Timer-Driven Process Thread-4] 
> org.apache.nifi.oauth2.StandardOauth2AccessTokenProvider 
> StandardOauth2AccessTokenProvider[id=861dbfea-0181-1000--d19b4cf0] 
> Refresh Access Token request failed 
> [https://foo.bar/auth/realms/myrealm/protocol/openid-connect/token]
> org.apache.nifi.processor.exception.ProcessException: OAuth2 access token 
> request failed [HTTP 400]
> at 
> org.apache.nifi.oauth2.StandardOauth2AccessTokenProvider.getAccessDetails(StandardOauth2AccessTokenProvider.java:327)
> at 
> org.apache.nifi.oauth2.StandardOauth2AccessTokenProvider.refreshAccessDetails(StandardOauth2AccessTokenProvider.java:315)
> at 
> org.apache.nifi.oauth2.StandardOauth2AccessTokenProvider.getAccessDetails(StandardOauth2AccessTokenProvider.java:249)
> at sun.reflect.GeneratedMethodAccessor408.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:254)
> at 
> org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:105)
> at com.sun.proxy.$Proxy183.getAccessDetails(Unknown Source)
> at 
> org.apache.nifi.processors.standard.InvokeHTTP.lambda$configureRequest$3(InvokeHTTP.java:1108)
> at java.util.Optional.ifPresent(Optional.java:159)
> at 
> org.apache.nifi.processors.standard.InvokeHTTP.configureRequest(InvokeHTTP.java:1107)
> at 
> org.apache.nifi.processors.standard.InvokeHTTP.onTrigger(InvokeHTTP.java:927)
> at 
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> at 
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1283)
> at 
> org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:214)
> at 
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:103)
> at org.apache.nifi.engine.FlowEngine$2.run(FlowEngine.java:110)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750)
>
> I can't find any information in the log, when the process successful refresh 
> the token. So I can't see if all nodes in the cluster is refreshing the token 
> at the same time, or if it's only the primary nodes which refresh. If all 
> nodes are refreshing could it be that one nodes is slower than the others to 
> refresh, and that the old tokens gets invalid after the first node has 
> refreshed it?
>
> Kind regards
> Jens M. Kofoed

Re: AvroRuntimeExceptions after upgrading to 1.17.0

2022-08-29 Thread Mike Thomsen

> ConvertAvroToJSON

Try replacing that with ConvertRecord using an Avro reader and a JSON
writer that inherits the Avro reader's schema.

On Mon, Aug 22, 2022 at 4:33 AM Weiss, Christian  wrote:
>
> Hi guys,
>
> we upgraded our NiFi dev/test instance around two weeks ago from 1.16.3 to 
> 1.17.0.
> We tested around with our existing flows and also developed some new flows in 
> 1.17.0 - no problems so far.
>
> At the end of last week we upgraded our production instance an ran into some 
> problems with the combination of ExecuteSQL -> ConvertAvroToJSON processors 
> in our existing flows. The ExecuteSQL processors in those flows are 
> configured using „Compression Format: SNAPPY“. All following 
> ConvertAvroToJSON processors of our old flows (developed with 1.16.3) 
> raising: "AvroRuntimeException: Unrecognized codec: snappy“. All flows which 
> were created using 1.17.0 doesn´t have such problems.
>
> If we turn off compression in ExecuteSQL, everything works fine.
>
> Anyone with an idea what´s going on here?
>
> Thanks,
> Christian
>
> 
> Das SVA Mail-System ist mit einem Mailverschlüsselungs-Gateway ausgestattet. 
> Wenn Sie moechten, dass an Sie gerichtete E-Mails verschluesselt werden, 
> senden Sie einfach eine S/MIME-signierte E-Mail oder Ihren PGP Public Key an 
> christian.we...@sva.de.
>
> The SVA mail system is equipped with an email encryption gateway. If you want 
> email sent to you to be encrypted please send a S/MIME signed email or your 
> PGP public key to christian.we...@sva.de.
>
>

Re: Execute script code Jython2.7, code update not reflecting on server

2022-07-09 Thread Mike Thomsen

> However the same version of jython on windows

Are you sure you're running **J**ython on Windows and not Python? The
behavior you're describing sounds like the differences between Jython
and CPython 3.

On Fri, Jul 8, 2022 at 6:17 AM Sushant Sawant
 wrote:
>
> Hi all,
>
>  After further debugging more, here are the observation,
>
>
>
> No matter what encoding I do, reading string is giving a unicode string in 
> jython nifi.
> M unable to parse quotable printable characters since it requires a byte 
> string as object.
>
>
>
> String is read from email body, is u'some email body' and not b'some email 
> body'.
>
>
> However the same version of jython on windows gives b'some email body' as a 
> proper str object.
>
> Any help guys m stuck since 3 days in this.
>
>
> On Thu, Jul 7, 2022 at 12:46 PM Sushant Sawant  
> wrote:
>>
>> Hi all,
>>
>> Use case is to read email and extract body and other meta data and save it 
>> in mongo.
>>
>> msg.get_payload()
>>
>> when I execute above line m getting, "2022=\n 15:23" in response. Quotable 
>> printable.
>>
>> msg.get_payload(decode=True)
>>
>> when I execute above line m getting, "2022 15:23" in response. Quotable 
>> printable is removed.This is as expected and works locally, when I pass eml 
>> file. I am using the later one, "decode=True" on server but it is not 
>> decoding quotable printable.
>>
>> quopri.decodestring(body)
>>
>> Then I tried above, but still it is not decoding as expected.Here is entire 
>> the script m using in local, running on Jython 2.7.2
>>
>> import email
>> import quopri
>> msg = email.message_from_file(open("some_eml.eml"))
>> body = ""
>> if msg.is_multipart():
>> for part in msg.walk():
>> ctype = part.get_content_type()
>> cdispo = str(part.get('Content-Disposition'))
>> if (ctype == 'text/plain' or ctype == 'text/html') and 'attachment' 
>> not in cdispo:
>> body = part.get_content()  # decode
>> print(body)
>> break
>> else:
>> body_byte = msg.get_payload()
>> print(repr(body_byte))
>> body = body_byte.decode("utf-8", 'ignore')
>> print(repr(body))
>> utf = quopri.decodestring(body)
>> text = utf.decode('utf-8', errors='replace')
>> print(repr(text))
>> print(text)
>>
>> One observation is, it is behaving as old script. Have restarted cluster.
>> Also did this, but didn't helped.nifi.flowcontroller.autoResumeState=false
>> nifi.swap.manager.implementation=org.apache.nifi.controller.FileSystemSwapManager
>> nifi.queue.swap.threshold=2
>> nifi.swap.in.period=5 sec
>> nifi.swap.in.threads=1
>> nifi.swap.out.period=5 sec
>> nifi.swap.out.threads=4Any help appreciated.Also created question 
>> here,https://community.cloudera.com/t5/Support-Questions/ExecuteScript-python-2-7-not-working-as-expected-on-server/td-p/346941
>>

Re: Most memory-efficient way in NiFi to fetch an entire RDBMS table?

2022-06-22 Thread Mike Thomsen

Thanks, Matt.

On Wed, Jun 22, 2022 at 10:46 AM Matt Burgess  wrote:
>
> Mike,
>
> I recommend QueryDatabaseTableRecord with judicious choices for the
> following properties:
>
> Fetch Size: This should be tuned to return the most number of rows
> without causing network issues such as timeouts. Can be set to the
> same value as Max Rows Per Flow File ensuring one fetch per outgoing
> FlowFile
> Max Rows Per Flow File: This should be set to a reasonable number of
> rows per FlowFile, maybe 100K or even 1M if that doesn't cause issues
> (see above)
> Output Batch Size: This is the key to doing full selects on huge
> tables, as it allows FlowFiles to be committed to the session and
> passed downstream while the rest of the fetch is being processed. In
> your case if you set Max Rows to 100K then this could be 10, or if you
> set it to 1M it could be 1. Note that with this property set, the
> maxvalue.* and fragment.count attributes will not be set on these
> FlowFiles, so you can't merge them.  I believe the maxvalue state will
> still be updated even if this property is used, so it should turn into
> an incremental fetch after the first full fetch is complete.
>
> Regards,
> Matt
>
> On Wed, Jun 22, 2022 at 10:00 AM Mike Thomsen  wrote:
> >
> > We have a table with 68M records that will blow up to over 250M soon,
> > and need to do a full table fetch on it. What's the best practice for
> > efficiently doing a partial or full select on it?
> >
> > Thanks,
> >
> > Mike

Most memory-efficient way in NiFi to fetch an entire RDBMS table?

2022-06-22 Thread Mike Thomsen

We have a table with 68M records that will blow up to over 250M soon,
and need to do a full table fetch on it. What's the best practice for
efficiently doing a partial or full select on it?

Thanks,

Mike

Re: Delta Lake table support in NiFi

2022-06-15 Thread Mike Thomsen

I looked into the feasibility of doing a direct integration and found
it to not be a really good fit because the other "integrations" I
found with Delta involved running Spark locally in the background
IIRC. Maybe that's changed, but the impression I got was that Delta is
more or less tied to how Spark works and doesn't fit in well with
different systems like NiFi (or Hive, as I think Hive was their first
"integration" attempt).

On Wed, Jun 15, 2022 at 12:42 PM scott  wrote:
>
> Hi community,
> I was wondering if there is an official effort to add delta table format ( 
> delta.io ) into NiFi? I saw there were some discussions a few years back, but 
> not sure what came of that. Not looking for a workaround where I have to use 
> spark to do the actual read/write into the delta table, just pure NiFi( Java 
> ) solution.
>
> Thanks,
> Scott

Re: Round robin load balancing eventually stops using all nodes

2022-04-01 Thread Mike Thomsen

I think I figured out how to get around this: partition-by-attribute
using UUID. About 10 minutes ago, I was down to 3/5 nodes on my
cluster. Switched the queues to that strategy, and the 3 full nodes
started sending work to the other two nodes without a restart.

On Fri, Apr 1, 2022 at 7:44 AM Mike Thomsen  wrote:
>
> I think I forgot to mention early on that we're using embedded
> ZooKeeper. Could that be a factor in this behavior?
>
> Thanks,
>
> Mike
>
> On Fri, Apr 1, 2022 at 7:28 AM Mike Thomsen  wrote:
> >
> > When we talk about "slower nodes" here, are we referring to nodes that
> > are bogged down by data but of the same size as the rest of the
> > cluster or are we talking about a heterogeneous cluster?
> >
> > On Mon, Sep 27, 2021 at 12:07 PM Joe Witt  wrote:
> > >
> > > Ryan,
> > >
> > > Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> > > now a better understanding of how it works and what options exist to
> > > better view details.
> > >
> > > Regarding Load Balancing: NIFI-7081 is largely about the scenario
> > > whereby in load balancing cases nodes which are slower effectively set
> > > the rate the whole cluster can sustain because we don't have a fluid
> > > load balancing strategy which we should.  Such a strategy would allow
> > > for the fastest nodes to always take the most data.  We just need to
> > > do that work.  No ETA.
> > >
> > > Thanks
> > >
> > > On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
> > >  wrote:
> > > >
> > > > Joe - We're testing some scenarios.  Andrew captured some confusing 
> > > > behavior in the UI when enabling and disabling load balancing on a 
> > > > relationship: "Update UI for Clustered Connections" -- 
> > > > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> > > >
> > > > Question - When a FlowFile is Load Balanced from one node to another, 
> > > > is the entire Content Claim load balanced?  Or just the small portion 
> > > > necessary?
> > > >
> > > > Mike -
> > > > We found two tickets that are in the ballpark:
> > > >
> > > > 1.  Improve handling of Load Balanced Connections when one node is slow 
> > > >   --https://issues.apache.org/jira/browse/NIFI-7081
> > > > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance 
> > > > strategy   --https://issues.apache.org/jira/browse/NIFI-8970
> > > >
> > > > From @Simon comment - we know we've seen underperforming nodes in a 
> > > > cluster before.  We're discussing @Simon's comment is applicable to the 
> > > > issue we're seeing
> > > >   > "The one thing I can think of is the scenario where one (or 
> > > > more) nodes are significantly slower than the other ones. In these 
> > > > cases it might happen then the nodes are “running behind” blocks the 
> > > > other nodes from balancing perspective."
> > > >
> > > > @Simon - I'd like to understand the "blocks other nodes from balancing 
> > > > perspective" better if you have additional information.  We're trying 
> > > > to replicate this scenario.
> > > >
> > > > Thanks,
> > > > Ryan
> > > >
> > > > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen  
> > > > wrote:
> > > >>
> > > >> > there is a ticket to overcome this (there is no ETA),
> > > >>
> > > >> Do you know what the Jira # is?
> > > >>
> > > >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence  
> > > >> wrote:
> > > >> >
> > > >> > Hi Mike,
> > > >> >
> > > >> > I did a quick check on the round robin balancing and based on what I 
> > > >> > found the reason for the issue must lie somewhere else, not directly 
> > > >> > within it. The one thing I can think of is the scenario where one 
> > > >> > (or more) nodes are significantly slower than the other ones. In 
> > > >> > these cases it might happen then the nodes are “running behind” 
> > > >> > blocks the other nodes from balancing perspective.
> > > >> >
> > > >> > Based on what you wrote this is a possible reason and there is a 
> > > >> > ticket to overcome this (there is no ETA), but other details might 
> > > >> > shed light to a different root cause.
> > > >> >
> > > >> > Regards,
> > > >> > Bence
> > > >> >
> > > >> >
> > > >> >
> > > >> > > On 2021. Sep 3., at 14:13, Mike Thomsen  
> > > >> > > wrote:
> > > >> > >
> > > >> > > We have a 5 node cluster, and sometimes I've noticed that round 
> > > >> > > robin
> > > >> > > load balancing stops sending flowfiles to two of them, and 
> > > >> > > sometimes
> > > >> > > toward the end of the data processing can get as low as a single 
> > > >> > > node.
> > > >> > > Has anyone seen similar behavior?
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Mike
> > > >> >

Re: Round robin load balancing eventually stops using all nodes

2022-04-01 Thread Mike Thomsen

I think I forgot to mention early on that we're using embedded
ZooKeeper. Could that be a factor in this behavior?

Thanks,

Mike

On Fri, Apr 1, 2022 at 7:28 AM Mike Thomsen  wrote:
>
> When we talk about "slower nodes" here, are we referring to nodes that
> are bogged down by data but of the same size as the rest of the
> cluster or are we talking about a heterogeneous cluster?
>
> On Mon, Sep 27, 2021 at 12:07 PM Joe Witt  wrote:
> >
> > Ryan,
> >
> > Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> > now a better understanding of how it works and what options exist to
> > better view details.
> >
> > Regarding Load Balancing: NIFI-7081 is largely about the scenario
> > whereby in load balancing cases nodes which are slower effectively set
> > the rate the whole cluster can sustain because we don't have a fluid
> > load balancing strategy which we should.  Such a strategy would allow
> > for the fastest nodes to always take the most data.  We just need to
> > do that work.  No ETA.
> >
> > Thanks
> >
> > On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
> >  wrote:
> > >
> > > Joe - We're testing some scenarios.  Andrew captured some confusing 
> > > behavior in the UI when enabling and disabling load balancing on a 
> > > relationship: "Update UI for Clustered Connections" -- 
> > > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> > >
> > > Question - When a FlowFile is Load Balanced from one node to another, is 
> > > the entire Content Claim load balanced?  Or just the small portion 
> > > necessary?
> > >
> > > Mike -
> > > We found two tickets that are in the ballpark:
> > >
> > > 1.  Improve handling of Load Balanced Connections when one node is slow   
> > > --https://issues.apache.org/jira/browse/NIFI-7081
> > > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance 
> > > strategy   --https://issues.apache.org/jira/browse/NIFI-8970
> > >
> > > From @Simon comment - we know we've seen underperforming nodes in a 
> > > cluster before.  We're discussing @Simon's comment is applicable to the 
> > > issue we're seeing
> > >   > "The one thing I can think of is the scenario where one (or 
> > > more) nodes are significantly slower than the other ones. In these cases 
> > > it might happen then the nodes are “running behind” blocks the other 
> > > nodes from balancing perspective."
> > >
> > > @Simon - I'd like to understand the "blocks other nodes from balancing 
> > > perspective" better if you have additional information.  We're trying to 
> > > replicate this scenario.
> > >
> > > Thanks,
> > > Ryan
> > >
> > > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen  
> > > wrote:
> > >>
> > >> > there is a ticket to overcome this (there is no ETA),
> > >>
> > >> Do you know what the Jira # is?
> > >>
> > >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence  
> > >> wrote:
> > >> >
> > >> > Hi Mike,
> > >> >
> > >> > I did a quick check on the round robin balancing and based on what I 
> > >> > found the reason for the issue must lie somewhere else, not directly 
> > >> > within it. The one thing I can think of is the scenario where one (or 
> > >> > more) nodes are significantly slower than the other ones. In these 
> > >> > cases it might happen then the nodes are “running behind” blocks the 
> > >> > other nodes from balancing perspective.
> > >> >
> > >> > Based on what you wrote this is a possible reason and there is a 
> > >> > ticket to overcome this (there is no ETA), but other details might 
> > >> > shed light to a different root cause.
> > >> >
> > >> > Regards,
> > >> > Bence
> > >> >
> > >> >
> > >> >
> > >> > > On 2021. Sep 3., at 14:13, Mike Thomsen  
> > >> > > wrote:
> > >> > >
> > >> > > We have a 5 node cluster, and sometimes I've noticed that round robin
> > >> > > load balancing stops sending flowfiles to two of them, and sometimes
> > >> > > toward the end of the data processing can get as low as a single 
> > >> > > node.
> > >> > > Has anyone seen similar behavior?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Mike
> > >> >

Re: Round robin load balancing eventually stops using all nodes

2022-04-01 Thread Mike Thomsen

When we talk about "slower nodes" here, are we referring to nodes that
are bogged down by data but of the same size as the rest of the
cluster or are we talking about a heterogeneous cluster?

On Mon, Sep 27, 2021 at 12:07 PM Joe Witt  wrote:
>
> Ryan,
>
> Regarding NIFI-9236 the JIRA captures it well but sounds like there is
> now a better understanding of how it works and what options exist to
> better view details.
>
> Regarding Load Balancing: NIFI-7081 is largely about the scenario
> whereby in load balancing cases nodes which are slower effectively set
> the rate the whole cluster can sustain because we don't have a fluid
> load balancing strategy which we should.  Such a strategy would allow
> for the fastest nodes to always take the most data.  We just need to
> do that work.  No ETA.
>
> Thanks
>
> On Tue, Sep 21, 2021 at 2:18 PM Ryan Hendrickson
>  wrote:
> >
> > Joe - We're testing some scenarios.  Andrew captured some confusing 
> > behavior in the UI when enabling and disabling load balancing on a 
> > relationship: "Update UI for Clustered Connections" -- 
> > https://issues.apache.org/jira/projects/NIFI/issues/NIFI-9236
> >
> > Question - When a FlowFile is Load Balanced from one node to another, is 
> > the entire Content Claim load balanced?  Or just the small portion 
> > necessary?
> >
> > Mike -
> > We found two tickets that are in the ballpark:
> >
> > 1.  Improve handling of Load Balanced Connections when one node is slow   
> > --https://issues.apache.org/jira/browse/NIFI-7081
> > 2.  NiFi FlowFiles stuck in queue when using Single Node load balance 
> > strategy   --https://issues.apache.org/jira/browse/NIFI-8970
> >
> > From @Simon comment - we know we've seen underperforming nodes in a cluster 
> > before.  We're discussing @Simon's comment is applicable to the issue we're 
> > seeing
> >   > "The one thing I can think of is the scenario where one (or 
> > more) nodes are significantly slower than the other ones. In these cases it 
> > might happen then the nodes are “running behind” blocks the other nodes 
> > from balancing perspective."
> >
> > @Simon - I'd like to understand the "blocks other nodes from balancing 
> > perspective" better if you have additional information.  We're trying to 
> > replicate this scenario.
> >
> > Thanks,
> > Ryan
> >
> > On Sat, Sep 18, 2021 at 3:45 PM Mike Thomsen  wrote:
> >>
> >> > there is a ticket to overcome this (there is no ETA),
> >>
> >> Do you know what the Jira # is?
> >>
> >> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence  
> >> wrote:
> >> >
> >> > Hi Mike,
> >> >
> >> > I did a quick check on the round robin balancing and based on what I 
> >> > found the reason for the issue must lie somewhere else, not directly 
> >> > within it. The one thing I can think of is the scenario where one (or 
> >> > more) nodes are significantly slower than the other ones. In these cases 
> >> > it might happen then the nodes are “running behind” blocks the other 
> >> > nodes from balancing perspective.
> >> >
> >> > Based on what you wrote this is a possible reason and there is a ticket 
> >> > to overcome this (there is no ETA), but other details might shed light 
> >> > to a different root cause.
> >> >
> >> > Regards,
> >> > Bence
> >> >
> >> >
> >> >
> >> > > On 2021. Sep 3., at 14:13, Mike Thomsen  wrote:
> >> > >
> >> > > We have a 5 node cluster, and sometimes I've noticed that round robin
> >> > > load balancing stops sending flowfiles to two of them, and sometimes
> >> > > toward the end of the data processing can get as low as a single node.
> >> > > Has anyone seen similar behavior?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Mike
> >> >

Re: Insufficient Permissions for Expression Language

2022-03-30 Thread Mike Thomsen

This looks specifically like there is some sort of enterprise setup
that is scanning for "malicious code" and not liking what it sees.
Talk to your IT folks about what sort of network security packages are
installed that might be jumping in between you and NiFi here.

On Wed, Mar 30, 2022 at 3:54 PM Mark Payne  wrote:
>
> Hi Stanley,
>
> That error message is not coming from NiFi. I would guess that you have some 
> sort of load balancer, proxy, etc. between you and the NiFi instance? WOuld 
> recommend looking at that to see if you can determine what’s happening there.
>
> Thanks
> -Mark
>
>
> On Mar 30, 2022, at 3:32 PM, Martin, Stanley L 
>  wrote:
>
>
>
> I have an instance of NiFi (v. 1.15.3) running in Cloud Foundry, and several 
> weeks ago I started getting a message that I have Insufficient Permissions 
> when I try to add or modify a processor property that contains Expression 
> Language.  The message I get is:
>
> Request RejectedThe requested URL was 
> rejected. Please consult with your administrator.Your support ID is: 
> 14975944297778607952[Go 
> Back]
>
> Does anyone have an idea what this means and how I can fix it?
>
> Thanks,
> Stanley
>
>

Re: VolatileContentRepository removal

2022-03-30 Thread Mike Thomsen

We've been moving away from supporting it for a while, and I think it
comes down to a lot of both factors when you consider the time
involved in getting good patches and reviewing them. That said, until
1.17 is released, I think there's room for community members like you
and your team to work with us on fixing the gaps that made a strong
case for removing it.

I think I saw in your ticket that you provided patches through Jira.
My recommendation would be to do a feature branch that reverts the
removal, applies your patches and submit it as a PR on GitHub. Then
request a review. Obviously, there's no guarantees there because it's
based on folks' time and energy to do a review, but that would be the
right process at least to move your request forward.

In the long run, I think it would be a lot better for you to share
your use case with us and to see if there's a better route ahead for
your team and NiFi. Sounds like an interesting use case, so it would
be good to get those requirements on the table since most users aren't
operating with those constraints.

Thanks,

Mike

On Tue, Mar 29, 2022 at 12:20 PM Matthieu Ré  wrote:
>
> Hi everyone,
>
> We wanted to talk about this ticket 
> https://issues.apache.org/jira/browse/NIFI-8760 and the 
> VolatileContentRepository... I understood that we weren't many to still use 
> this Repository, but in our use case with a very limited cloud environment 
> with strict IOps regulations, it fitted perfectly and we managed several To 
> of data per day efficiently.
>
> We tested other repositories, even a FileSystemContentRepo with RAM based 
> disk that did not match the case since we experimented numerous OOMs with the 
> same amount of RAM mounted.
>
> I provided a patch to fix it, that should be applied after 1.13.0 and a 
> refactor of Claims handling, waiting for a discussion about it. Now I read 
> that it should disappear in 1.17.0 :(
>
> Is it due to a technical limitation for further features ? Or is it  too 
> costly to maintain it ?
>
> Thanks! Regards,
> Matthieu

Re: QueryRecord with Union type

2022-03-18 Thread Mike Thomsen

{"name":"flag_s","type":["int","boolean"]}

We have a lot of type massaging baked into the Record API. If the int
version is meant to be used as a boolean equivalent (0 = false,
anything else is truthy) then this is something that probably already
is or should be covered by that type massaging.

On Fri, Mar 18, 2022 at 5:50 AM  wrote:
>
> Mark,
>
>
>
> Thank you for your response. I thought that was probably the case, but I 
> tried a cast and it did not work. I got this error.
>
>
>
> Query:
>
> select *
>
> from flowfile
>
> where cast(flag_s as boolean) = true
>
>
>
> Error:
>
> org.apache.calcite.sql.validate.SqlValidatorException: Cast function cannot 
> convert value of type JavaType(class java.lang.Object) to type BOOLEAN
>
>
>
> By taking the union out of the input schema I could get the query to work, 
> but I did find myself getting tangled up in managing various schemas so I am 
> trying to use infer/inherit read/write services instead. I have inherited a 
> very complex flow from a team that have long departed and am looking to 
> simplify it to improve performance and maintainability. I need to convert 
> from CSV/TSV to JSON, normalise fields, filter unwanted records, enrich with 
> more JSON and finally publish to a customer defined schema, so I do need a 
> few steps along the way. I am exploring each step in order to validate my 
> redesign so I take your point about minimising the number of processes and 
> will look again at combining steps in the query process, although I am also a 
> fan of the JOLT transform as I have used that often in previous projects.
>
>
>
> Regards
>
> Steve Hindmarch
>
>
>
> From: Mark Payne 
> Sent: 17 March 2022 14:17
> To: users 
> Subject: Re: QueryRecord with Union type
>
>
>
> Steve,
>
>
>
> Because your schema has a union, the SQL engine doesn’t really know how to 
> interpret the data. So it interprets it as a “Java Object.” Essentially,
>
> it could be anything. But you can’t compare just anything to true - you need 
> to compare a boolean to true. So you need to tell the SQL engine that the
>
> value you’re looking at is, in fact, a boolean.
>
>
>
> You can do that with a simple CAST() function in your SQL:
>
>
>
> SELECT *
>
> FROM FLOWFILE
>
> WHERE CAST(flag_s AS BOOLEAN) = true
>
>
>
> That should give you what you’re looking for.
>
>
>
> Also worth nothing - you mentioned that you’re using ConvertRecord and 
> UpdateRecord before QueryRecord.
>
> 99% of the time, you should not be using ConvertRecord in conjunction with 
> any other Record processor. Because the Record processors like UpdateRecord
>
> allow you to use any Record Reader, it doesn’t make sense to convert the data 
> first using ConvertRecord - it’s just extra overhead.
>
> And, in fact, you may be able to eliminated the UpdateRecord, as well, as 
> just use the SQL within QueryRecord to perform the transformation needed on 
> the fly,
>
> rather than having another step to update the data, which requires reading 
> the data, parsing it, updating it, serializing the data, writing the data. 
> This may not
>
> be possible, depends on what you’re updating. But QueryRecord does support 
> RecordPath expressions so it’s worth considering.
>
>
>
> Thanks
>
> -Mark
>
>
>
>
>
>
>
> On Mar 15, 2022, at 8:35 AM, stephen.hindma...@bt.com wrote:
>
>
>
> I am having a play with QueryRecord to do some filtering but I have run 
> across this problem. I have a schema for my records which includes a union 
> type, so the relevant part of the schema is
>
>
>
> {
>
>   "type":"record",
>
>   "namespace":"blah",
>
>   "name":"SimpleTraffic",
>
>   "fields":[
>
> {"name":"src_address","type":"string"},
>
> {"name":"flag_s","type":["int","boolean"]}
>
>   ]
>
> }
>
>
>
> This is because I am processing CSV records that look this, where 1 is true 
> and 0 is false.
>
>
>
> 192.168.0.1,1
>
>
>
> Into JSON that looks like this, using a ConvertRecord and an Update Record.
>
>
>
> {"src_address":"192.168.0.1","flag_s":true}
>
>
>
> Then I create a QueryRecord so I can filter out the cases where the flag is 
> false. So I use this query.
>
>
>
> select * from flowfile where flag_s = true
>
>
>
> But I get this error
>
>
>
> org.apache.calcite.sql.validate.SqlValidatorException: Cannot apply '=' to 
> arguments of type ' = '
>
>
>
> Is this because the type is a Union type and the Calcite processor cannot 
> work out which subtype it should be? Can I do anything to persuade the query 
> to use an operator or a function on this field to make it usable? I have 
> tried casting to Boolean or Char but no success. Or do I need to use two 
> separate “before” and “after” schemas to eliminate the union?
>
>
>
> Regards
>
>
>
> Steve Hindmarch
>
>

Re: Are counters disabled when running a cluster?

2022-03-11 Thread Mike Thomsen

That would do it... I was under the impression that I had all rights
on this cluster, but apparently I don't.

On Fri, Mar 11, 2022 at 9:08 AM James McMahon  wrote:
>
> I'm running in clustered config and I see Counters from the hamburger menu in 
> the upper right. What do you have set for your "Access counters" policy?
>
> On Fri, Mar 11, 2022 at 9:01 AM Mike Thomsen  wrote:
>>
>> I'm running 1.13.2, and the Counters menu item is the only one
>> disabled on my cluster. Is that normal?
>>
>> Thanks,
>>
>> Mike

Re: Records - Best Approach to Enrich Record From Cache

2022-03-07 Thread Mike Thomsen

I skimmed over the code in the Redis DMC client, and did not see any
place where we could do a MGET there. Not sure if that's relevant to
Nick's use case, but it would be relevant to that general pattern
going forward. It wouldn't be hard to add a bulk get method to the DMC
interface and provide a default interface that just loops and does
multiple get operations and stacks them together. Then the Redis
version could do a MGET and stack them together.

That said, AFAIK we'd need to create a new enrichment process or
extend something like ScriptedTransformRecord to integrate with a DMC.

I have the time to work on this, but would like to hear from
committers and users before I start banging out the code to make sure
I'm not missing something.

On Mon, Mar 7, 2022 at 7:18 AM  wrote:
>
> Redis does allow multiple gets in the one hit with MGET. If you search for 
> all keys the response is an ordered list of matching values, with null in 
> place where there is no match.
>
>
>
> Steve Hindmarch
>
>
>
> From: Nick Lange 
> Sent: 07 March 2022 04:46
> To: users@nifi.apache.org
> Subject: Records - Best Approach to Enrich Record From Cache
>
>
>
> HI all -
>
>  I have a record set of objects that each need enrichment of about 10/20 
> fields of data from a Redis Cache. In a perfect world, I'd hit the cache once 
> and return a json blob for further extraction  - ideally in a single hop.  I 
> don't see an easy way to do this with the record language, but perhaps I've 
> missed something.
>
>
>
> Lacking any better sophistication, I'm currently doing this brute-force with 
> 10-20 hits to the cache for each field. I'm hoping that the mailing list has 
> better suggestions.
>
>
>
> Thank you
>
> Nick
>
>

Set single user credentials on Windows

2022-02-18 Thread Mike Thomsen

Is there a Windows equivalent to ./nifi.sh set-single-user-credentials
for automating the setup of the admin user on a vanilla installation?

Thanks,

Mike

Re: Does NiFi support Data Lake or Streaming Platform ?

2022-01-07 Thread Mike Thomsen

Yes and no. You can do the majority of the ingestion work from NiFi by
having NiFi write your records as Parquet files and upload them to S3,
followed by a small Spark job to integrate them into your existing
Delta Lake. I've done a demo on that before (can't share the code) and
it was pretty easy to write. NiFi does most of the work of converting
and cleaning up the input; the Spark job would just read the Parquet
files and append them to your Delta Lake.

On Fri, Jan 7, 2022 at 1:41 AM Hao Wang  wrote:
>
> Dear NiFi devs :
>
> I'm new to NiFi, and I want to know if NiFi supports Data Lake or Streaming 
> Platform ?
>
> Bravo !
> Hao Wang

Re: Round robin load balancing eventually stops using all nodes

2021-09-18 Thread Mike Thomsen

> there is a ticket to overcome this (there is no ETA),

Do you know what the Jira # is?

On Mon, Sep 6, 2021 at 7:14 AM Simon Bence  wrote:
>
> Hi Mike,
>
> I did a quick check on the round robin balancing and based on what I found 
> the reason for the issue must lie somewhere else, not directly within it. The 
> one thing I can think of is the scenario where one (or more) nodes are 
> significantly slower than the other ones. In these cases it might happen then 
> the nodes are “running behind” blocks the other nodes from balancing 
> perspective.
>
> Based on what you wrote this is a possible reason and there is a ticket to 
> overcome this (there is no ETA), but other details might shed light to a 
> different root cause.
>
> Regards,
> Bence
>
>
>
> > On 2021. Sep 3., at 14:13, Mike Thomsen  wrote:
> >
> > We have a 5 node cluster, and sometimes I've noticed that round robin
> > load balancing stops sending flowfiles to two of them, and sometimes
> > toward the end of the data processing can get as low as a single node.
> > Has anyone seen similar behavior?
> >
> > Thanks,
> >
> > Mike
>

Re: Round robin load balancing eventually stops using all nodes

2021-09-10 Thread Mike Thomsen

The use case where we most often run into this problem involves
extracting content from tarballs of varying sizes that are fairly
large. These tarballs vary in size from 80GB to the better part of
500GB and contain a ton of 250k-1MB files in them; about 1.5M files
per tarball is the norm.

(I am aware that this is a really bad way to get data for NiFi, but
the upstream source has absolutely refused to change their export
methodology)

On Tue, Sep 7, 2021 at 5:03 PM Joe Witt  wrote:
>
> Ryan
>
> If this is so easily replicated for you it should be trivially found and 
> fixed most likely.
>
> Please share, for each node in your cluster, both a thread dump and heap dump 
> within 30 mins of startup and again after 24 hours.
>
> This will allow us to see the delta and if there appears to be any sort of 
> leak.   If you cannot share these then you can do that analysis and share the 
> results.
>
> Nobody should have to restart nodes to keep things healthy.
>
> Joe
>
> On Tue, Sep 7, 2021 at 12:58 PM Ryan Hendrickson 
>  wrote:
>>
>> We have a daily cron job that restarts our nifi cluster to keep it in a good 
>> state.
>>
>> On Mon, Sep 6, 2021 at 6:11 PM Mike Thomsen  wrote:
>>>
>>> >  there is a ticket to overcome this (there is no ETA), but other details 
>>> > might shed light to a different root cause.
>>>
>>> Good to know I'm not crazy, and it's in the TODO. Until then, it seems
>>> fixable by bouncing the box.
>>>
>>> On Mon, Sep 6, 2021 at 7:14 AM Simon Bence  wrote:
>>> >
>>> > Hi Mike,
>>> >
>>> > I did a quick check on the round robin balancing and based on what I 
>>> > found the reason for the issue must lie somewhere else, not directly 
>>> > within it. The one thing I can think of is the scenario where one (or 
>>> > more) nodes are significantly slower than the other ones. In these cases 
>>> > it might happen then the nodes are “running behind” blocks the other 
>>> > nodes from balancing perspective.
>>> >
>>> > Based on what you wrote this is a possible reason and there is a ticket 
>>> > to overcome this (there is no ETA), but other details might shed light to 
>>> > a different root cause.
>>> >
>>> > Regards,
>>> > Bence
>>> >
>>> >
>>> >
>>> > > On 2021. Sep 3., at 14:13, Mike Thomsen  wrote:
>>> > >
>>> > > We have a 5 node cluster, and sometimes I've noticed that round robin
>>> > > load balancing stops sending flowfiles to two of them, and sometimes
>>> > > toward the end of the data processing can get as low as a single node.
>>> > > Has anyone seen similar behavior?
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Mike
>>> >

Re: Round robin load balancing eventually stops using all nodes

2021-09-06 Thread Mike Thomsen

>  there is a ticket to overcome this (there is no ETA), but other details 
> might shed light to a different root cause.

Good to know I'm not crazy, and it's in the TODO. Until then, it seems
fixable by bouncing the box.

On Mon, Sep 6, 2021 at 7:14 AM Simon Bence  wrote:
>
> Hi Mike,
>
> I did a quick check on the round robin balancing and based on what I found 
> the reason for the issue must lie somewhere else, not directly within it. The 
> one thing I can think of is the scenario where one (or more) nodes are 
> significantly slower than the other ones. In these cases it might happen then 
> the nodes are “running behind” blocks the other nodes from balancing 
> perspective.
>
> Based on what you wrote this is a possible reason and there is a ticket to 
> overcome this (there is no ETA), but other details might shed light to a 
> different root cause.
>
> Regards,
> Bence
>
>
>
> > On 2021. Sep 3., at 14:13, Mike Thomsen  wrote:
> >
> > We have a 5 node cluster, and sometimes I've noticed that round robin
> > load balancing stops sending flowfiles to two of them, and sometimes
> > toward the end of the data processing can get as low as a single node.
> > Has anyone seen similar behavior?
> >
> > Thanks,
> >
> > Mike
>

Round robin load balancing eventually stops using all nodes

2021-09-03 Thread Mike Thomsen

We have a 5 node cluster, and sometimes I've noticed that round robin
load balancing stops sending flowfiles to two of them, and sometimes
toward the end of the data processing can get as low as a single node.
Has anyone seen similar behavior?

Thanks,

Mike

Re: Need help to create avro schema for arrays with tags

2021-08-10 Thread Mike Thomsen

You'll probably want to go more like this:

{
  "name": "objectDetails",
  "type": {
"type": "array",
"items": {
  "name": "AdditionalInfoRecord",
  "type": "record",
  "fields": [
{ "name": "name", "type": "string" },
{ "name": "value", "type": "string" }
  ]
}
  }
}

That would go for a JSON like this:

"objectDetails": [
  { "name": "tag1", "value": "value1" }, //etc.
]

On Wed, Jul 28, 2021 at 8:17 AM Jens M. Kofoed  wrote:
>
> Dear community
>
> I'm struggling with transforming some xml data into a json format using an 
> avro schema.
> The data which I can't get to work looks something like this:
> 
> 
> value1
> value2
> value3
> 
> 
> 1
> objType
> 
> 
>
> If I set the type for additionalInfo to array, I only gets the values. I 
> tried to set the array items to a record. But I can't get the tag names.
>
> My goal is to get a json like this:
> "object" : {
> "objectDetails" : [
> { "additionalInfo" : "tag1", "value":"value1"}
> { "additionalInfo" : "tag2", "value":"value2"}
> { "additionalInfo" : "tag3", "value":"value3"}
> ],
> "objectIdentification" : {
>   "objectId" : 1,
>   "objectType" : " objType  "
> }
>   }
>
> or
>
> "object" : {
> "objectDetails" : {
> "additionalInfo" : [
>  {"name":"tag1", "value":"value1"},
>  { "name":"tag2", "value":"value2"},
>  { "name":"tag3", "value":"value3"}
> ]
> },
> "objectIdentification" : {
>   "objectId" : 1,
>   "objectType" : " objType  "
> }
>   }
>
> Kind regards
> Jens M. Kofoed
>
>

Re: Is a prompt for a user cert normal on startup?

2021-08-04 Thread Mike Thomsen

Yeah, that's why I was scratching my head. I didn't install a user
cert in my keychain on macOS and was wondering how a new self-signed
cert my browser didn't trust was getting me prompted for a cert.

On Tue, Aug 3, 2021 at 3:39 PM David Handermann
 wrote:
>
> Hi Mike,
>
> With the default configuration using HTTPS in 1.14.0, the browser will prompt 
> for a certificate if one is available. The NiFi Jetty server is configured to 
> request a certificate, but it is not required. That is the reason for the 
> browser prompt, and the reason canceling the request then prompts for 
> username and password authentication.
>
> Regards,
> David Handermann
>
> On Tue, Aug 3, 2021 at 1:59 PM Mike Thomsen  wrote:
>>
>> I built a fresh copy of 1.15.0-SNAPSHOT and got prompted for a cert
>> when I hit the web console. It ultimately didn't block me from logging
>> in with u/p. Is that normal behavior? I ask because my main laptop is
>> a corporate one that does some funny things with our security
>> settings.
>>
>> Thanks,
>>
>> Mike

Is a prompt for a user cert normal on startup?

2021-08-03 Thread Mike Thomsen

I built a fresh copy of 1.15.0-SNAPSHOT and got prompted for a cert
when I hit the web console. It ultimately didn't block me from logging
in with u/p. Is that normal behavior? I ask because my main laptop is
a corporate one that does some funny things with our security
settings.

Thanks,

Mike

Re: Nifi throws an error when reading a large csv file

2021-04-14 Thread Mike Thomsen

I could be totally barking up the wrong tree, but I think this is our
clue: Requested array size exceeds VM limit

That means that something is causing the reader to try to allocate an
array with a number of entries greater than the VM allows.

Without seeing the schema, a sample of the CSV and a stacktrace it's
pretty hard to guess what's going on. For what it's worth, I've split
55GB JSON sets using a custom streaming JSON reader without a hiccup
on a NiFi instance with only 4-8GB of RAM allocated, so I'm fairly
confident we've got some quirky edge case here.

If you want to sanitize some inputs and share along with a schema that
might help.

On Wed, Apr 14, 2021 at 1:07 PM Vibhath Ileperuma
 wrote:
>
> Hi Chris,
>
> As you have mentioned, I am trying to split the large csv file in multiple 
> stages. But this error is thrown at the first stage even without creating a 
> single flow file.
> It seems like the issue is not with the processor, but with the CSV record 
> reader. This error is thrown while reading the csv file. I tried to write the 
> data in the large csv file into a kudu table using a putKudu processor with 
> the same CSV reader. Then also I got the same error message.
>
> Hi Otto,
>
> Only following information is available in log file related to the exception
>
> 2021-04-14 17:48:28,628 ERROR [Timer-Driven Process Thread-1] 
> o.a.nifi.processors.standard.SplitRecord 
> SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] 
> SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] failed to process 
> session due to java.lang.OutOfMemoryError: Requested array size exceeds VM 
> limit; Processor Administratively Yielded for 1 sec: 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> 2021-04-14 17:48:28,628 WARN [Timer-Driven Process Thread-1] 
> o.a.n.controller.tasks.ConnectableTask Administratively Yielding 
> SplitRecord[id=c9a981db-0178-1000-363d-c767653a6f34] due to uncaught 
> Exception: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
> On Wed, Apr 14, 2021 at 7:47 PM Otto Fowler  wrote:
>>
>> What is the complete stack trace of that exception?
>>
>> On Apr 14, 2021, at 02:36, Vibhath Ileperuma  
>> wrote:
>>
>> Requested array size exceeds VM limit
>>
>>

Re: Groovy script

2021-02-24 Thread Mike Thomsen

If file_path is pointing to a folder as you said, it's going to check
for the folder's existence. The fact that it's failing to return true
there suggests that something is wrong with the path in the file_path
attribute.

On Wed, Feb 24, 2021 at 11:47 AM Tomislav Novosel
 wrote:
>
> Hi guys,
>
>
>
> I want to check if file exists with this groovy script:
>
>
>
> flowfile = session.get()
> if(!flowfile) return
>
> file_path = flowfile.getAttribute('file_path')
> File file = new File(file_path)
>
> if(file.exists()){
> session.transfer(flowfile, REL_FAILURE)
> }
> else{
> session.transfer(flowfile, REL_SUCCESS)
> }
>
>
>
> and to route all files which exist to FAILURE relationship, but all of them 
> go to SUCCESS, file is for sure in the folder
>
> ‘file_path’, I checked.
>
>
>
> What am I doing wrong?
>
>
>
> Thanks,
>
>
>
> Tom

Re: CM/CI best practices

2021-01-22 Thread Mike Thomsen

Geoff,

Here's a blog post of mine that shows how to do unit testing against
Groovy scripts that you run in your flows:

https://mikethomsen.github.io/posts/2020/11/08/testing-executescript-modules-with-the-nifi-test-framework/

As far as repositories goes, the NiFi Registry is the best route for
doing CM work while also being able to easily transition between
environments.

My team just uses Chef to do repeatable deployments of NiFi and the
Registry to move between environments. We don't do automated testing.

Mike

On Thu, Jan 21, 2021 at 2:58 PM Greene (US), Geoffrey N
 wrote:
>
> I’m trying to figure out some CI/CM best practices.  I want to be able to 
> design a flow, test the flow on some test data, then distribute that exact 
> same configuration (definitely flows, probably services, and so on)  into 
> production.  I may have multiple engineers working in this environment, and I 
> want to be able to store my files in a repository, and be able to do standard 
> git merge/branches etc.  Of course, you don’t want your branch to merge to 
> master if it hasn’t passed test.  I have already scripted some simple python 
> tests that can start nifi, start a flow, and verify output, so I know that CI 
> CAN work.
>
> I may choose to go to a clustered solution, too, so I’d want to be able to 
> spin up additional cluster nodes if needed.
>
>
>
> So what is the recommended way to do this?   Here are some of the options 
> I’ve come up with:
>
>
>
> 1)  Have a dedicated nifi instance, and only CM the flows (using nifi 
> repository).  If I understand this correctly, this means that the 
> configuration of nifi itself, would not be CM’d. Im not clear on how services 
> would be handled, if a new flow requires an internal service.  I don’t like 
> this much, since it doesn’t seem terribly repeatable, but maybe its how the 
> overall system is designed.
>
> 2)  Configuration control EVERY file.  This means that as the database 
> changes while authoring a flow, new commits from a developer would be 
> required.  This seems troublesome, though, as merges would be difficult, and 
> it would be difficult to actually tell what changed.  Hopefully no flowfiles 
> would go into the repo.
>
> 3)  Configuration control SOME of the files (though not all of them) in 
> the nifi directory structure.  I’m not clear on which ones though.  Maybe 
> whole directories?  A guide would be helpful.
>
> 4)  Have one git repository housing the nifi repository (the flows).  
> Have another repository that houses the nifi software.  The repo containing 
> the flows would be updated frequently, the one containing the flows would NOT 
> be updated as frequently.
>
> 5)  Don’t do CM at all.  It can’t be done.  Rely on backups only.
>
>
>
> I’m still struggling with how to maintain some of the custom groovy scripts 
> I’ve written too that are kept on disk.
>
>
>
> In any event how do others do this?  Are there any wikis/articles on this?
>
>
>
> Thanks for your thoughts
>
> -geoff

Re: files larger than queue size limit

2020-12-16 Thread Mike Thomsen

To add to that, you should compress the content before loading into S3
or you will be paying a lot more than you have to.

On Wed, Dec 16, 2020 at 6:49 AM Pierre Villard
 wrote:
>
> Yes it should work just fine. The relationship backpressure settings are just 
> soft limits: if backpressure is not enabled, then the upstream processor can 
> be triggered even if the processor generates a huge flow file that would 
> cause the backpressure to be enabled. The backpressure mechanism is only at 
> trigger time.
>
> Regarding memory, the record processors are processing data in a streaming 
> fashion, the data will never get fully loaded into memory.
>
> Generally speaking, NiFi is agnostic of the data size and can deal with any 
> kind of large/small files.
>
> Hope this helps,
> Pierre
>
>
> Le mer. 16 déc. 2020 à 06:39, naga satish  a écrit :
>>
>> My team designed a NiFi flow to handle CSV files of size around 15GB. But 
>> later we realised that files can be upto 500 GB. I set the queue size limit 
>> to 25GB. This is a one time data load to S3. I'm converting each CSV file to 
>> parquet in NiFi using a convert record processor. What happens in these 
>> situations? Can NiFi be able to handle this kind of scenario?
>>
>> FYI, my NiFi has 40 gigs of memory and 2TB of storage.
>>
>> Regards
>> Satish

Re: ExecuteScript Concurrent Tasks

2020-12-13 Thread Mike Thomsen

The concurrency setting doesn't give extra threads to the jython engine, if
that's what you mean. It sets the maximum threads that can run
ExecuteScript simultaneously. So all you'd get is two single threaded
scripts running if you bump it to two threads.

As to the groovy issue, absolutely. Groovy is night and day more performant
and supported than jython. The only python code my team uses with nifi is
edge cases where what we need can only be scripted in a few lines with
python 3 and it's worth the overhead of calling that with execute stream
command.

On Thu, Dec 10, 2020, 11:53 Noe Detore  wrote:

> Hello,
>
> Concurrent tasks increased using ExecuteScript or InvokeScriptedProcessor
> with python/jyphon to update content has no increased throughput. If I copy
> the processor and run the 2 in parallel the amount of data processed does
> not increase. Any explanation for this? Is there a system-wide setting for
> how much cpu is available to the Jython engine?
>
> Would refactoring into groovy improve throughput or is it best to create a
> custom processor?
>
> thank you
> Noe
>

How to debug 500 error in nifi-api /controller-services method

2020-12-10 Thread Mike Thomsen

I set the root logger to debug in logback.xml, and I'm still not
seeing any stacktraces in nifi-app.log.  Is there something else I
need to update?

Thanks,

Mike

Re: NIFI and Out of Memory Error

2020-12-03 Thread Mike Thomsen

One of my colleagues ran into a similar situation, and all that was
required to fix it was to make ReplaceText work line by line. When you
do that, you shouldn't run into any issues.

On Thu, Dec 3, 2020 at 1:04 PM jgunvaldson  wrote:
>
> Just looking for an opinion
>
> Knowing (for one example) that ReplaceText Processor can be very memory 
> intensive with large files - we are finding it more and more common to wake 
> up to an Out of Memory error like the following
>
> 2020-12-03 15:07:21,748ZUTC ERROR [Timer-Driven Process Thread-31] 
> o.a.nifi.processors.standard.ReplaceText 
> ReplaceText[id=352afe80-4195-3f56-8798-aaf8be160581] 
> ReplaceText[id=352afe80-4195-3f56-8798-aaf8be160581] failed to process 
> session due to java.lang.OutOfMemoryError: Java heap space; Processor 
> Administratively Yielded for 1 sec: java.lang.OutOfMemoryError: Java heap 
> space
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.nifi.processors.standard.ReplaceText.onTrigger(ReplaceText.java:255)
> at 
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> at 
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1162)
>
>
> My question is this. Knowing that "When an OOME occurs in a JVM this can 
> cause the JVM to skip instructions. Skipping instructions can compromise the 
> integrity of the JVM memory without displaying errors. You can't always tell 
> from the outside if a JVM has compromised memory, the only safe thing to do 
> is restart the JVM.”
>
> And in this case “Restart NIFI”
>
> Is that “our collective” understanding also, that a Restart of NIFI is 
> mandatory - or optional?
>
> Thanks
>
> John
>

Re: Filename attribute is changing automatically at fetchS3 processor.

2020-11-21 Thread Mike Thomsen

It drops the prefix and tries to set a file name in keeping with
normal NiFi practices on setting the filename.

On Fri, Nov 20, 2020 at 11:06 AM naga satish  wrote:
>
> I have created a nifi flow in which I’m listing files from S3 and fetching 
> those files. After the files are fetched by FetchS3 the name of the file is 
> changing. Here is the screenshot. My NiFi supported by cloudera and hosted on 
> AWS.

Stacktrace from ParquetReader

2020-11-20 Thread Mike Thomsen

java.lang.NullPointerException: Name is null
at java.lang.Enum.valueOf(Enum.java:236)
at 
org.apache.parquet.hadoop.metadata.CompressionCodecName.valueOf(CompressionCodecName.java:26)
at 
org.apache.nifi.parquet.utils.ParquetUtils.createParquetConfig(ParquetUtils.java:172)
at 
org.apache.nifi.parquet.ParquetReader.createRecordReader(ParquetReader.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:254)
at 
org.apache.nifi.controller.service.StandardControllerServiceInvocationHandler.invoke(StandardControllerServiceInvocationHandler.java:105)
at com.sun.proxy.$Proxy190.createRecordReader(Unknown Source)
at 
org.apache.nifi.processors.standard.SplitRecord$1.process(SplitRecord.java:156)

I think it's because I forgot to set the schema write strategy on the
ParquetRecordSetWriter to be "write schema." Is there a workaround for
this?

Thanks,

Mike

Re: ScriptedReader - how to reuse Java libraries from nifi-standard-nar?

2020-11-20 Thread Mike Thomsen

This is a flow that we're starting to use with some of our scripts:

https://mikethomsen.github.io/posts/2020/11/08/testing-executescript-modules-with-the-nifi-test-framework

It's pretty easy to just fire up IntelliJ (community or premium) and
test scripts out that way. It's not necessary for trivial ones, but if
you find yourself writing a lot of business logic, you can get a lot
of coverage this way and add regression testing that doesn't require
you to manually test things by hand from the browser.

On Fri, Nov 20, 2020 at 4:57 AM Piper, Nick  wrote:
>
> Thank you Mike for the thought on it, we'll consider that approach - in 
> particular, it does look useful that the (compiled) code is reloaded without 
> having to restart NiFi.
>
> For many of our uses of scripted code, a theoretical advantage is avoiding 
> having to install binaries or add things to the filesystem. It's unfortunate 
> that any significant script will almost certainly need some extra library, 
> such as those which make up NiFi itself, but the script cannot use that. I 
> wonder how 'stateless' NiFi will be coping with this, as the target runtime 
> operating system filesystems will need to have the script dependencies 
> available.
>
> Regards,
>
>  Nick
>
> -Original Message-
> From: Mike Thomsen 
> Sent: 17 November 2020 1:31 PM
> To: users@nifi.apache.org
> Subject: Re: ScriptedReader - how to reuse Java libraries from 
> nifi-standard-nar?
>
> I would recommend creating a fat jar that has precisely what you need and 
> referencing that. That's a pattern my team's been using for a while to get 
> near custom NAR functionality through ExecuteScript.
> Here's an overview of what it looks like:
>
> https://mikethomsen.github.io/posts/2020/11/06/taking-executescript-to-the-next-level-with-fat-jars/
>
> You should be able to copy and paste the Maven pom.xml and update it to match 
> your needs pretty easily. Make sure to configure any NiFi api jars as 
> "provided" scope.
>
> On Tue, Nov 17, 2020 at 7:23 AM Piper, Nick  wrote:
> >
> > I've implemented a groovy ScriptedReader , and wish to reuse jar libraries 
> > which are already part of NiFi from my script.
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_d
> > ocs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dscripting-2Dnar_1.5.
> > 0_org.apache.nifi.record.script.ScriptedReader_index.html=DwIFaQ=H
> > 50I6Bh8SW87d_bXfZP_8g=qtiHf3oxfzao0QlKv0Pa4nnORYuepkcxMMbn7wBUyq0=
> > 2bdrSUoix61CO2woFcsKtrz6Ajj4aH6ZlQqnZjJC2Io=X5wC0G47W5cF1RevN2DF9MkU
> > AKmIuSa4r-5SOnqZno4=
> >
> > The script can be seen at
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_p
> > ipern_2d67fbc0a4225cf1b42d1fff832c2c54=DwIFaQ=H50I6Bh8SW87d_bXfZP_
> > 8g=qtiHf3oxfzao0QlKv0Pa4nnORYuepkcxMMbn7wBUyq0=2bdrSUoix61CO2woFcs
> > Ktrz6Ajj4aH6ZlQqnZjJC2Io=2KmU08qgvlzGeJbOFbGpHWXfrxcZTfyG8I6UdAxp9cs
> > =
> >
> > My script needs some of the 'jar files' which are present in 
> > nifi-standard-nar. Is there a way I can reuse those already inside that nar 
> > file? Otherwise I'll have to unpack the nar, place the jar files in some 
> > other folder, and refer to that folder in 'Module Directory' - on every 
> > machine in the NiFi cluster. I'm looking for maybe some way to have my 
> > goovy script use a different classloader (?) or somehow reuse the existing 
> > jar files inside that nar file. Maybe it would be reliable to predict the 
> > 'work' folder and look for unpacked nar files in there...
> >
> > Many thanks,
> >
> >  Nick

Re: ScriptedReader - how to reuse Java libraries from nifi-standard-nar?

2020-11-17 Thread Mike Thomsen

I would recommend creating a fat jar that has precisely what you need
and referencing that. That's a pattern my team's been using for a
while to get near custom NAR functionality through ExecuteScript.
Here's an overview of what it looks like:

https://mikethomsen.github.io/posts/2020/11/06/taking-executescript-to-the-next-level-with-fat-jars

You should be able to copy and paste the Maven pom.xml and update it
to match your needs pretty easily. Make sure to configure any NiFi api
jars as "provided" scope.

On Tue, Nov 17, 2020 at 7:23 AM Piper, Nick  wrote:
>
> I've implemented a groovy ScriptedReader , and wish to reuse jar libraries 
> which are already part of NiFi from my script.
>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-scripting-nar/1.5.0/org.apache.nifi.record.script.ScriptedReader/index.html
>
> The script can be seen at 
> https://gist.github.com/pipern/2d67fbc0a4225cf1b42d1fff832c2c54
>
> My script needs some of the 'jar files' which are present in 
> nifi-standard-nar. Is there a way I can reuse those already inside that nar 
> file? Otherwise I'll have to unpack the nar, place the jar files in some 
> other folder, and refer to that folder in 'Module Directory' - on every 
> machine in the NiFi cluster. I'm looking for maybe some way to have my goovy 
> script use a different classloader (?) or somehow reuse the existing jar 
> files inside that nar file. Maybe it would be reliable to predict the 'work' 
> folder and look for unpacked nar files in there...
>
> Many thanks,
>
>  Nick

Setting up debug logging on InvokeHttp

2020-11-12 Thread Mike Thomsen

I'm trying to get a good look at the request that is being sent over.
Does anyone know what sort of logback configuration would work best
for showing the headers and the body being sent?

Thanks,

Mike

Re: NIFI Groovy Script - Filter file names and get count

2020-10-30 Thread Mike Thomsen

You need to use the listFiles() that has a FilenameFilter interface in it:

https://docs.oracle.com/javase/8/docs/api/java/io/File.html#listFiles-java.io.FilenameFilter-

On Thu, Oct 29, 2020 at 5:12 PM KhajaAsmath Mohammed
 wrote:
>
> Hi,
>
> I have a requirement where I need to get the file count from the path using 
> the groovy script.
>
> I came up with the below but unable to filter and count only txt files . Any 
> suggestions please?
>
> import org.apache.commons.io.IOUtils
> import java.nio.charset.*;
> import java.io.*;
>
> def flowFile = session.get()
> if(!flowFile) return
>
>
> def eahpath = flowFile.getAttribute("eahpath")
> def count = new File(eahpath).listFiles().size();// I need to filter only 
> txt files and get count
> flowFile=session.putAttribute(flowFile,"eahfilecount",count+"");
> def fail = false
> if(fail){
> session.transfer(flowFile, REL_FAILURE)
> fail = false
> } else {
> session.transfer(flowFile, REL_SUCCESS)
> }
>
> Thanks,
> Asmath

Re: Run Nifi in IntelliJ to debug?

2020-10-26 Thread Mike Thomsen

Are you using a binary derived from the source code in your IDE? Like
a 1.12.1 binary and the source code from the release?

On Mon, Oct 26, 2020 at 7:47 PM Russell Bateman  wrote:
>
> Hmmm... It's rare that I debug NiFi code. And it's also rare that I debug my 
> own in that context since the NiFi test runner allows me to fend off most 
> surprises via my JUnit tests.
>
> I think back in 2016, I was debugging a start-up problem involving NiFi 
> start-up and incompatibility with the Java Flight Recorder. As I recall, I 
> downloaded the relevant NiFi code sources matching the version of NiFi I was 
> debugging remotely. I remember ultimately making a slight (and only 
> temporary) change to NiFi start-up that fixed the problem. At that point I 
> must have been building my own copy to have seen it fixed.. It had to do with 
> the order in which NiFi was getting command-line arguments making it so the 
> JFR wasn't running. I'd have to dig back to figure out what I was doing, but 
> it's probably not too relevant to what you need to do.
>
> What do you need to see in this?
>
> Russ
>
> On 10/26/20 5:38 PM, Darren Govoni wrote:
>
> Correct. Primarily the nifi-web-api module and AccessResource class. For 
> starters.
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
> 
> From: Russell Bateman 
> Sent: Monday, October 26, 2020 7:37:13 PM
> To: Darren Govoni ; users@nifi.apache.org 
> 
> Subject: Re: Run Nifi in IntelliJ to debug?
>
> Darren,
>
> This is just Apache NiFi code out of NARs you want to step through or is it 
> yours? You haven't stripped debug information or anything, right?
>
> Russ
>
> On 10/26/20 5:30 PM, Darren Govoni wrote:
>
> Kevin/Russel
>
> Thanks for the info. I did set things up this way.
>
> IntelliJ does connect to the nifi jvm and nifi runs and works but intellij 
> isnt breaking on code it should.
>
> I did set the module where the code/classes are located (in the remote 
> connection dialog) and i see the exception im tracking print on the console 
> output but intellij never breaks.
>
> Is there an extra step needed? Generate sources?
>
> For future it would be nice if there was a maven goal for debug.
>
> Much appreciated!
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
> 
> From: Russell Bateman 
> Sent: Monday, October 26, 2020 4:09:50 PM
> To: users@nifi.apache.org ; Darren Govoni 
> 
> Subject: Re: Run Nifi in IntelliJ to debug?
>
> Darren,
>
> I was out this morning and didn't see your plea until I got in just now. 
> Here's a step by step I wrote up for both IntelliJ IDEA and Eclipse (I'm more 
> an IntelliJ guy). It also covers using an IP tunnel.
>
> https://www.javahotchocolate.com/notes/nifi.html#20160323
>
> On 10/26/20 9:52 AM, Darren Govoni wrote:
>
> Hi
>Is it possible to run Nifi from inside IntelliJ with debugging such that I 
> can hit the app from my browser and trigger breakpoints?
>
> If anyone has done this can you please share any info?
>
> Thanks in advance!
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
>
>
>

Re: Run Nifi in IntelliJ to debug?

2020-10-26 Thread Mike Thomsen

If you want to use the Docker image, add "-e NIFI_JVM_DEBUGGER=1" and
map port 8000 to something on your machine.

On Mon, Oct 26, 2020 at 4:10 PM Russell Bateman  wrote:
>
> Darren,
>
> I was out this morning and didn't see your plea until I got in just now. 
> Here's a step by step I wrote up for both IntelliJ IDEA and Eclipse (I'm more 
> an IntelliJ guy). It also covers using an IP tunnel.
>
> https://www.javahotchocolate.com/notes/nifi.html#20160323
>
> On 10/26/20 9:52 AM, Darren Govoni wrote:
>
> Hi
>Is it possible to run Nifi from inside IntelliJ with debugging such that I 
> can hit the app from my browser and trigger breakpoints?
>
> If anyone has done this can you please share any info?
>
> Thanks in advance!
> Darren
>
> Sent from my Verizon, Samsung Galaxy smartphone
> Get Outlook for Android
>
>

Re: Putdatabase Record - Value too large for column

2020-10-26 Thread Mike Thomsen

One way you might be able to get there would be to add a SplitRecord
on the failure relationship of the processor and have it loop its
output back to the processor so you can narrow down which record was
failing.

On Sun, Oct 25, 2020 at 1:58 PM KhajaAsmath Mohammed
 wrote:
>
> Hi,
>
> I am using putddabase record with json tree reader to insert data into 
> database. This works great but is there a possibility to get column name in 
> the error?  I need to open the file and see the text to find out the column.
>
> putdatabaserecord.error
> SAP DBTech JDBC: Value too large for column:
>
> Displaying column name will be really helpful. Any suggestions?
>
> Thanks,
> Asmath
>

Re: Build Problem 1.11.4 on MacOS

2020-10-20 Thread Mike Thomsen

I had build problems on macOS for a long time, and when I switched to
u265 everything seemed to build again.

On Tue, Oct 20, 2020 at 12:47 PM Joe Witt  wrote:
>
> Darren,
>
> I believe there were gremlins in that JDK release.. Can you please try 
> something like 265?
>
> On Tue, Oct 20, 2020 at 8:52 AM Darren Govoni  wrote:
>>
>> Hi,
>>   Seem to have this recurring problem trying to build on MacOS with 
>> nifi-utils. Anyone have a workaround or fix for this?
>>
>> Thanks in advance!
>>
>> [ERROR] Failed to execute goal 
>> org.apache.maven.plugins:maven-compiler-plugin:3.8.1:testCompile 
>> (groovy-tests) on project nifi-utils: Compilation failure
>> [ERROR] Failure executing groovy-eclipse compiler:
>> [ERROR] Annotation processing got disabled, since it requires a 1.6 
>> compliant JVM
>> [ERROR] Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
>> initialize class org.codehaus.groovy.vmplugin.v7.Java7
>> [ERROR] at 
>> org.codehaus.groovy.vmplugin.VMPluginFactory.(VMPluginFactory.java:43)
>>
>> AFAIK my jvm is compliant
>>
>> dgovoni@C02RN8AHG8WP nifi % java -version
>> openjdk version "1.8.0_262"
>> OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_262-b10)
>> OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.262-b10, mixed mode)
>> dgovoni@C02RN8AHG8WP nifi %
>>
>>

Re: Compare two NiFi servers

2020-10-14 Thread Mike Thomsen

Copy flow.xml.gz from both and either write a script or find a tool
that can look for the processor references. That file is essentially
the master state of what you see on the canvas.

On Tue, Oct 13, 2020 at 5:44 PM Alberto Dominguez
 wrote:
>
> Hello,
>
> I have two environments and one NiFi server in each one. How can I compare
> which flows and processors that I have in each one?
>
> I have until 5 levels of grouping.
>
> NiFi Home --> First --> Second...
>
> Thank you!

Re: Query Record processor

2020-09-23 Thread Mike Thomsen

https://calcite.apache.org/docs/reference.html

On Wed, Sep 23, 2020 at 3:24 PM Mike Thomsen  wrote:
>
> Asmath,
>
> I would check the Apache Calcite docs to see what syntax is supported.
> I ran into a minor head-scratcher there as well a few months ago when
> some date function I was expecting turned out to not be implemented
> yet.
>
> Mike
>
> On Wed, Sep 23, 2020 at 3:04 PM KhajaAsmath Mohammed
>  wrote:
> >
> > Hi,
> >
> > I am looking for some information on how to check datatypes of the data and 
> > load transform them accordingly. I am okay to use any other processor to.
> >
> > My req:
> >
> > Check if column is Integer, if integer then load to _INT column else null 
> > value
> > Check if column length is > 256, if more than 256 load to _Text column else 
> > load to varchar column.
> >
> > I am assuming we can use case statements and length in query record but not 
> > able to get the syntax. Any help is appreciated
> >
> > Thanks,
> > Asmath

Re: Query Record processor

2020-09-23 Thread Mike Thomsen

Asmath,

I would check the Apache Calcite docs to see what syntax is supported.
I ran into a minor head-scratcher there as well a few months ago when
some date function I was expecting turned out to not be implemented
yet.

Mike

On Wed, Sep 23, 2020 at 3:04 PM KhajaAsmath Mohammed
 wrote:
>
> Hi,
>
> I am looking for some information on how to check datatypes of the data and 
> load transform them accordingly. I am okay to use any other processor to.
>
> My req:
>
> Check if column is Integer, if integer then load to _INT column else null 
> value
> Check if column length is > 256, if more than 256 load to _Text column else 
> load to varchar column.
>
> I am assuming we can use case statements and length in query record but not 
> able to get the syntax. Any help is appreciated
>
> Thanks,
> Asmath

Re: Best way to tune NiFi for huge amounts of small flowfiles

2020-09-11 Thread Mike Thomsen

Craig and Jeremy,

Thanks. The point about using different disks for different
repositories is definitely something to add to the list.

On Fri, Sep 11, 2020 at 3:11 PM Jeremy Dyer  wrote:
>
> Hey Mike,
>
> When you say "flows that may drop in several million ... flowfiles" I read 
> that as a single node that might be inundated with tons of source data (local 
> files, ftp, kafka messages, etc). Just my 2 cents but if you don't have 
> strict SLAs (and this kind of sounds like a 1 time thing) I wouldn't even 
> worry about it and just let the system back pressure and process in time as 
> designed. That process will be "safe" although maybe not fast. If you need 
> speed throw lots of NVMe mounts at it. We process well into the tens 
> (sometimes hundreds) of millions of flowfiles a day on a 5 node cluster with 
> no issues. However our hardware is quite over the top.
>
> Thanks,
> Jeremy Dyer
>
> On Fri, Sep 11, 2020 at 12:51 PM Mike Thomsen  wrote:
>>
>> What are the general recommended practices around tuning NiFi to
>> safely handle flows that may drop in several million very small
>> flowfiles (2k-10kb each) onto a single node? It's possible that some
>> of the data dumps we're processing (and we can't control their size)
>> will drop about 3.5-5M flowfiles the moment we expand them in the
>> flow.
>>
>> (Let me emphasize again, it was not our idea to dump the data this way)
>>
>> Any pointers would be appreciated.
>>
>> Thanks,
>>
>> Mike

Best way to tune NiFi for huge amounts of small flowfiles

2020-09-11 Thread Mike Thomsen

What are the general recommended practices around tuning NiFi to
safely handle flows that may drop in several million very small
flowfiles (2k-10kb each) onto a single node? It's possible that some
of the data dumps we're processing (and we can't control their size)
will drop about 3.5-5M flowfiles the moment we expand them in the
flow.

(Let me emphasize again, it was not our idea to dump the data this way)

Any pointers would be appreciated.

Thanks,

Mike

Re: Modifying putElasticSearchHTTP processor to use AWS IAM role based awscredentialprovide service for access

2020-09-11 Thread Mike Thomsen

You would probably be better off implementing your own controller
service with the same interface.

On Wed, Sep 9, 2020 at 10:13 PM sanjeet rath  wrote:
>
> Thank you MIke for the quick reply.
> I was really struggling with this functionality.
> i have gone through the code ,what i understood is i should use the 
> "nifi-elastic-search-restapi-processor" project.
>
> In it the JsonQueryelasticSearch processor, it uses the "Client Service" 
> Controller service. and i need to modify this controler. service to use AWS 
> shared code which i shared with you in the trailed mail chain.
>
> Is my understanding is correct ?
>
> Regards,
> Sanjeet
>
>
>
> On Thu, Sep 10, 2020 at 3:18 AM Mike Thomsen  wrote:
>>
>> Sanjeet,
>>
>> As provided, this won't integrate well with the existing NiFi
>> processors. You would need to implement it as a controller service
>> object and update the processors to use it. Also, if you want to use
>> processors based on the official Elasticsearch client API, the ones
>> under the "REST API bundle" are the best fit because they already use
>> controller services that use the official Elastic clients.
>>
>> Thanks,
>>
>> Mike
>>
>> On Wed, Sep 9, 2020 at 12:14 PM sanjeet rath  wrote:
>> >
>> > Hi ,
>> >
>> > We are using AWS managed ElasticSearch and our nifi is hosted in EC2.
>> > I have a use case of building a custom processor on top of 
>> > putElasticSearchHTTP, where it will use aws IAM based role 
>> > awscredentialprovider service to connect AWS ElasticSearch.
>> > This will be similar to PUTSQS where we are using IAM role based 
>> > awscredentialprovider service to connect SQS and its working fine.
>> >
>> > But there is no awscredentailprovider controller service is available in 
>> > putElasticSearchHTTP.
>> >
>> > So my plan is adding a awscredentailprovider controller service to 
>> > putElasticSearchHTTP , where i will use bellow code  to connect to 
>> > elasticsearch.
>> >
>> > Is my approach correct ? Could you provide any better thought on this ?
>> >
>> > public class AmazonElasticsearchServiceSample { private static String 
>> > serviceName = "es"; private static String region = "us-west-1"; private 
>> > static String aesEndpoint = "https://domain.us-west-1.es.amazonaws.com;; 
>> > private static String payload = "{ \"type\": \"s3\", \"settings\": { 
>> > \"bucket\": \"your-bucket\", \"region\": \"us-west-1\", \"role_arn\": 
>> > \"arn:aws:iam::123456789012:role/TheServiceRole\" } }"; private static 
>> > String snapshotPath = "/_snapshot/my-snapshot-repo"; private static String 
>> > sampleDocument = "{" + "\"title\":\"Walk the Line\"," + 
>> > "\"director\":\"James Mangold\"," + "\"year\":\"2005\"}"; private static 
>> > String indexingPath = "/my-index/_doc"; static final 
>> > AWSCredentialsProvider credentialsProvider = new 
>> > DefaultAWSCredentialsProviderChain(); public static void main(String[] 
>> > args) throws IOException { RestClient esClient = esClient(serviceName, 
>> > region); // Register a snapshot repository HttpEntity entity = new 
>> > NStringEntity(payload, ContentType.APPLICATION_JSON); Request request = 
>> > new Request("PUT", snapshotPath); request.setEntity(entity); // 
>> > request.addParameter(name, value); // optional parameters Response 
>> > response = esClient.performRequest(request); 
>> > System.out.println(response.toString()); // Index a document entity = new 
>> > NStringEntity(sampleDocument, ContentType.APPLICATION_JSON); String id = 
>> > "1"; request = new Request("PUT", indexingPath + "/" + id); 
>> > request.setEntity(entity); // Using a String instead of an HttpEntity sets 
>> > Content-Type to application/json automatically. // 
>> > request.setJsonEntity(sampleDocument); response = 
>> > esClient.performRequest(request); System.out.println(response.toString()); 
>> > }
>> > public static RestClient esClient(String serviceName, String region) { 
>> > AWS4Signer signer = new AWS4Signer(); signer.setServiceName(serviceName); 
>> > signer.setRegionName(region); HttpRequestInterceptor interceptor = new 
>> > AWSRequestSigningApacheInterceptor(serviceName, signer, 
>> > credentialsProvider); return 
>> > RestClient.builder(HttpHost.create(aesEndpoint)).setHttpClientConfigCallback(hacb
>> >  -> hacb.addInterceptorLast(interceptor)).build(); }
>> > https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-request-signing.html
>> >
>> >
>> >
>> > Regards,
>> > Sanjeet
>> >
>> > --
>> > Sanjeet Kumar Rath,
>> > mob- +91 8777577470
>> >
>
>
>
> --
> Sanjeet Kumar Rath,
> mob- +91 8777577470
>

Re: Modifying putElasticSearchHTTP processor to use AWS IAM role based awscredentialprovide service for access

2020-09-09 Thread Mike Thomsen

Sanjeet,

As provided, this won't integrate well with the existing NiFi
processors. You would need to implement it as a controller service
object and update the processors to use it. Also, if you want to use
processors based on the official Elasticsearch client API, the ones
under the "REST API bundle" are the best fit because they already use
controller services that use the official Elastic clients.

Thanks,

Mike

On Wed, Sep 9, 2020 at 12:14 PM sanjeet rath  wrote:
>
> Hi ,
>
> We are using AWS managed ElasticSearch and our nifi is hosted in EC2.
> I have a use case of building a custom processor on top of 
> putElasticSearchHTTP, where it will use aws IAM based role 
> awscredentialprovider service to connect AWS ElasticSearch.
> This will be similar to PUTSQS where we are using IAM role based 
> awscredentialprovider service to connect SQS and its working fine.
>
> But there is no awscredentailprovider controller service is available in 
> putElasticSearchHTTP.
>
> So my plan is adding a awscredentailprovider controller service to 
> putElasticSearchHTTP , where i will use bellow code  to connect to 
> elasticsearch.
>
> Is my approach correct ? Could you provide any better thought on this ?
>
> public class AmazonElasticsearchServiceSample { private static String 
> serviceName = "es"; private static String region = "us-west-1"; private 
> static String aesEndpoint = "https://domain.us-west-1.es.amazonaws.com;; 
> private static String payload = "{ \"type\": \"s3\", \"settings\": { 
> \"bucket\": \"your-bucket\", \"region\": \"us-west-1\", \"role_arn\": 
> \"arn:aws:iam::123456789012:role/TheServiceRole\" } }"; private static String 
> snapshotPath = "/_snapshot/my-snapshot-repo"; private static String 
> sampleDocument = "{" + "\"title\":\"Walk the Line\"," + "\"director\":\"James 
> Mangold\"," + "\"year\":\"2005\"}"; private static String indexingPath = 
> "/my-index/_doc"; static final AWSCredentialsProvider credentialsProvider = 
> new DefaultAWSCredentialsProviderChain(); public static void main(String[] 
> args) throws IOException { RestClient esClient = esClient(serviceName, 
> region); // Register a snapshot repository HttpEntity entity = new 
> NStringEntity(payload, ContentType.APPLICATION_JSON); Request request = new 
> Request("PUT", snapshotPath); request.setEntity(entity); // 
> request.addParameter(name, value); // optional parameters Response response = 
> esClient.performRequest(request); System.out.println(response.toString()); // 
> Index a document entity = new NStringEntity(sampleDocument, 
> ContentType.APPLICATION_JSON); String id = "1"; request = new Request("PUT", 
> indexingPath + "/" + id); request.setEntity(entity); // Using a String 
> instead of an HttpEntity sets Content-Type to application/json automatically. 
> // request.setJsonEntity(sampleDocument); response = 
> esClient.performRequest(request); System.out.println(response.toString()); }
> public static RestClient esClient(String serviceName, String region) { 
> AWS4Signer signer = new AWS4Signer(); signer.setServiceName(serviceName); 
> signer.setRegionName(region); HttpRequestInterceptor interceptor = new 
> AWSRequestSigningApacheInterceptor(serviceName, signer, credentialsProvider); 
> return 
> RestClient.builder(HttpHost.create(aesEndpoint)).setHttpClientConfigCallback(hacb
>  -> hacb.addInterceptorLast(interceptor)).build(); }
> https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-request-signing.html
>
>
>
> Regards,
> Sanjeet
>
> --
> Sanjeet Kumar Rath,
> mob- +91 8777577470
>

Re: scraping aspx web

2020-09-02 Thread Mike Thomsen

You're better off with a tool like Scrapy for something like this:
https://scrapy.org/

On Wed, Sep 2, 2020 at 2:07 PM tkg_cangkul  wrote:

> Dear All,
>
> I wanna try to scrapping aspx web with nifi. is there any suggestion to
> convert aspx grid into html table or csv file ?
>
> Below is the sample aspx grid view format that i've got
>
>
>
> Is this possible to do with nifi?
> Need advice.
>
>
> Best Regards,
>

Re: How to read an element inside another element in json with UpdateRecord

2020-08-26 Thread Mike Thomsen

That's JsonPath, not a record path, but it would be almost the same:
/data/[1][0] to get the date. Adjust the array indexes accordingly to get
other values.

On Tue, Aug 25, 2020 at 5:52 PM Eric Ladner  wrote:

> Try  $.data[6][1]  to get the "15.m.."..  entry.
>
> On Tue, Aug 25, 2020 at 3:17 PM Wesley C. Dias de Oliveira <
> wcdolive...@gmail.com> wrote:
>
>> Hi, Nifi Community.
>>
>> I'm trying to read an element inside another element in json with
>> UpdateRecord in the following json:
>>
>> "data": [
>> ["Date", "Campaign name", "Cost"],
>> ["2020-08-25", "01.M.VL.0.GSP", 75.14576],
>> ["2020-08-25", "11.b.da.0.search", 344.47],
>> ["2020-08-25", "12.m.dl.0.search", 98.04],
>> ["2020-08-25", "13.m.dl.0.search", 276.98],
>> ["2020-08-25", "14.m.dl.0.search", 23.7],
>> ["2020-08-25", "15.m.dl.0.search", 3.87],
>> ["2020-08-25", "16.b.da.0.search", 4.2],
>> ["2020-08-25", "19.m.dl.0.display", 71.452542],
>> ["2020-08-25", "55.m.vl.1.youtube", 322.875653],
>> ["2020-08-25", "57.m.dl.0.youtube", 124.061768],
>> ["2020-08-25", "58.m.vl.1.youtube", 0.387847],
>> ["2020-08-25", "59.m.vl.1.youtube", 72.637692],
>> ["2020-08-25", "62.b.vl.1.youtube", 1.397887]
>> ]
>>
>> For example, I need to get the value '59.m.vl.1.youtube' or the date
>> value '2020-08-25'.
>>
>> Here's my processor settings:
>> [image: image.png]
>>
>> Can someone suggest something?
>>
>> Thank you.
>> --
>> Grato,
>> Wesley C. Dias de Oliveira.
>>
>> Linux User nº 576838.
>>
>
>
> --
> Eric Ladner
>

Re: Calling DLL functions on a Windows PC

2020-08-25 Thread Mike Thomsen

ExecuteStreamCommand might be able to work with rundll or you could
wrap the DLL with some .NET code and make an exe which
ExecuteStreamCommand would run.

On Tue, Aug 25, 2020 at 9:49 AM Jeremy Pemberton-Pigott
 wrote:
>
> Hi,
>
> Is there a processor that allows me to make functions in a local DLL?  Jython 
> or maybe Python script perhaps or I would have to write a custom jar with JNI 
> calls?  NodeJS Addons has the ability to do it and it works, now I want to 
> make the same calls from a NiFi flow without NodeJS.
>
> Jeremy

Re: need parse first few lines to customize a header for csv

2020-08-01 Thread Mike Thomsen

The key here would be "is it the same?" If it's not, the most
practical solution would be to pursue a business decision to fix the
upstream process.

On Sat, Aug 1, 2020 at 2:47 AM Jens M. Kofoed  wrote:
>
> Hi
>
> What about the ReplaceText. If you know it will always be 4 header lines, you 
> could use a regex like this: ^topic.*?$ ^.*?$ ^.*?$ ^.*?$
> or using 4 ReplaceText after each other, only replacing First-Line.
>
> regards
> Jens M. Kofoed
>
> Den fre. 31. jul. 2020 kl. 21.09 skrev gemlan123 :
>>
>> I have some csv files with first few lines like this:
>>
>> topic: abc ,,
>> some comments: xxx,,
>> col1,col2,col3,flag
>> ,
>> ,,,col4,col5
>> data1,data2,data3,data4,data5
>> ..
>>
>> I want to parse the first 4 lines and have a flowfile like:
>> col1,col2,col3,col4,col5
>> data1,data2,data3,data4,data5
>>
>> try to use RouteText processor to achieve it and if possible try to avoid
>> scan all the lines within the file.
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/

Re: ExecuteScript with apache Ignite

2020-07-24 Thread Mike Thomsen

Try building a fat jar that has your Ignite dependencies in it and
referencing it in the module configuration of ExecuteScript. You might
be seeing a collision between the Grapes classloader and the one NiFi
is using here.

On Fri, Jul 24, 2020 at 12:17 PM Carlos Manuel Fernandes (DSI)
 wrote:
>
> Hello,
>
>
>
> I am trying to connect Nifi with Apache Ignite  to put  some data  on Ignite 
> cache using ExecuteScript because  putIgniteCache  and GetIgniteCache 
> processors are bounded to an older Ignite version.
>
>
>
> I made Test1 (below) using standalone groovy without Nifi and work Well.   
> Test2(below) using Nifi groovy ExecuteScript  in Ignition.start  always  run 
> on error:  java.lang.ClassNotFoundException: 
> org.apache.ignite.configuration.IgniteConfiguration. I am certain the two 
> Grabs work well  because  I haven’t errors on Import statements and  the jars 
> are in the grapes Folder.
>
>
>
> Any idea?
>
>
>
> Thanks
>
> Carlos
>
>
>
> Test1 – StandAlone groovy program (work well)
>
>
>
>
>
> @Grab ('org.apache.ignite:ignite-core:2.8.1')
>
> @Grab ('org.apache.ignite:ignite-spring:2.8.1')
>
>
>
> import org.apache.ignite.Ignite;
>
> import org.apache.ignite.IgniteCache;
>
> import org.apache.ignite.Ignition
>
>
>
> Ignition.setClientMode(true);
>
>
>
> // Here, we provide the cache configuration file
>
> Ignite objIgnite = Ignition.start("c:\\tmp\\ignite\\first-config.xml");
>
>
>
> // create cache if not already existing
>
> IgniteCache objIgniteCache = 
> objIgnite.getOrCreateCache("myFirstIgniteCache");
>
>
>
> // Populating the cache with few values
>
> objIgniteCache.put(1, "salman");
>
> objIgniteCache.put(2, "Abhishek");
>
> objIgniteCache.put(3, "Siddharth");
>
> objIgniteCache.put(4, "Dev");
>
>
>
> // Get these items from cache
>
> System.out.println(objIgniteCache.get(1));
>
> System.out.println(objIgniteCache.get(2));
>
> System.out.println(objIgniteCache.get(3));
>
> System.out.println(objIgniteCache.get(4));
>
> Ignition.stop(true);
>
>
>
>
>
>
>
> Test2 – ExecuteScript groovy code (don’t work)
>
>
>
> @Grab ('org.apache.ignite:ignite-core:2.8.1')
>
> @Grab ('org.apache.ignite:ignite-spring:2.8.1')
>
>
>
> import org.apache.ignite.Ignite;
>
> import org.apache.ignite.IgniteCache;
>
> import org.apache.ignite.Ignition
>
>
>
> def flowFile = session.get()
>
> if (!flowFile) return
>
>
>
> try {
>
>Ignition.setClientMode(true);
>
>
>
>// Here, we provide the cache configuration file
>
>log.info("Before Ignite")
>
>Ignite objIgnite = 
> Ignition.start("/apps/nifi-scripts/first-config.xml");
>
>log.info("After Ignite")
>
>
>
>// create cache if not already existing
>
>IgniteCache objIgniteCache = 
> objIgnite.getOrCreateCache("myFirstIgniteCache");
>
>
>
>// Populating the cache with few values
>
>log.info("Put on cache")
>
>objIgniteCache.put(1, "salman");
>
>objIgniteCache.put(2, "Abhishek");
>
>objIgniteCache.put(3, "Siddharth");
>
>objIgniteCache.put(4, "carlos");
>
>
>
>// Get these items from cache
>
>log.info("get from cache")
>
>System.out.println(objIgniteCache.get(1));
>
>System.out.println(objIgniteCache.get(2));
>
>System.out.println(objIgniteCache.get(3));
>
>System.out.println(objIgniteCache.get(4));
>
>
>
>log.info("cachequery for 4 is :${objIgniteCache.get(4)}")
>
>Ignition.stop(true);
>
>session.transfer(flowFile, REL_SUCCESS)
>
> }
>
> catch(Exception e) {
>
>log.error('Error:', e)
>
>session.transfer(flowFile, REL_FAILURE)
>
> }
>
> finally {
>
>
>
>log.info("end")
>
> }
>
>
>
>
>
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.ignite.configuration.IgniteConfiguration
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at org.springframework.util.ClassUtils.forName(ClassUtils.java:251)
>
> at 
> org.springframework.beans.factory.support.AbstractBeanDefinition.resolveBeanClass(AbstractBeanDefinition.java:408)
>
> at 
> org.springframework.beans.factory.support.AbstractBeanFactory.doResolveBeanClass(AbstractBeanFactory.java:1444)
>
> at 
> org.springframework.beans.factory.support.AbstractBeanFactory.resolveBeanClass(AbstractBeanFactory.java:1389)
>
> ... 37 common frames omitted

Iteratively concatenate record fields?

2020-07-22 Thread Mike Thomsen

Here's our use case:

Schema is

- addresses:list
  - street:string
  - city:string
  ...
- clean_addresses:list

Something like this using pseudo  recordpath:

foreach(/addresses[*], concat(./street, ' ', ./city, ...)) => /clean_addresses

Can't seem to get that iterative element in there. Is there something
I'm missing or would we need a recordpath function like that pseudo
foreach?

Thanks,

Mike

Re: Enrichment of record data with a REST API

2020-06-29 Thread Mike Thomsen

Matt,

Yeah, I was thinking about that, but there are a lot of variables that have
come up since I wrote that service. One of the big ones is how to take
partial responses and merge them? There's no transformer API for that
service. Nothing like a Groovy script, JOLT, etc.

What do you think? I think something JOLT-based similar to
JoltTransformRecord could be a starting point.

On Mon, Jun 29, 2020 at 5:14 PM Matt Burgess  wrote:

> Mike,
>
> I think you can use LookupRecord with a RestLookupService to do this.
> If it's missing features or it otherwise doesn't work for your use
> case, please let us know and/or write up whatever Jiras you feel are
> appropriate.
>
> Regards,
> Matt
>
> On Mon, Jun 29, 2020 at 4:56 PM Mike Thomsen 
> wrote:
> >
> > Does anyone know a good pattern using the Record API to enrich a data
> set record by record with a REST API?
> >
> > Thanks,
> >
> > Mike
>

Enrichment of record data with a REST API

2020-06-29 Thread Mike Thomsen

Does anyone know a good pattern using the Record API to enrich a data set
record by record with a REST API?

Thanks,

Mike

Re: NiFi-light for analysts

2020-06-29 Thread Mike Thomsen

As far as I can tell, Kylo is dead based on their public github activity.

Mark,

Would it make sense for us to start modularizing nifi-assembly with more
profiles? That way people like Boris could run something like this:

mvn install -Pinclude-grpc,include-graph,!include-kafka,!include-mongodb

On Mon, Jun 29, 2020 at 11:20 AM Boris Tyukin  wrote:

> Hi Mark, thanks for the great comments and for working on these
> improvements. these are great enhancements that we
> can certainly benefit from - I am thinking of two projects at least we
> support today.
>
> As far as making it more user-friendly, at some point I looked at Kylo.io
> and it was quite an interesting project - not sure if it is alive still -
> but I liked how they created their own UI/tooling around NiFi.
>
> I am going to toy with this idea to have a "dumb down" version of NiFi.
>
> On Sun, Jun 28, 2020 at 3:36 PM Mark Payne  wrote:
>
>> Hey Boris,
>>
>> There’s a good bit to unpack here but I’ll try to answer each question.
>>
>> 1) I would say that the target audience for NiFi really is a person with
>> a pretty technical role. Not developers, necessarily, though. We do see a
>> lot of developers using it, as well as data scientists, data engineers, sys
>> admins, etc. So while there may be quite a few tasks that a non-technical
>> person can achieve, it may be hard to expose the platform to someone
>> without a technical background.
>>
>> That said, I do believe that you’re right about the notion of flow
>> dependencies. I’ve done some work recently to help improve this. For
>> example, NIFI-7476 [1] makes it possible to configure a Process Group in
>> such a way that only a single FlowFile at a time is allowed into the group.
>> And the data is optionally held within the group until that FlowFile has
>> completed processing, even if it’s split up into many parts. Additionally,
>> NIFI-7509 [2] updates the List* processors so that they can use an optional
>> Record Writer. This makes it possible to get a full listing of a directory
>> from ListFile as a single FlowFile. Or a listing of all items in an S3
>> bucket or an Azure Blob Store, etc. So when that is combined with
>> NIFI-7476, it makes it very easy to process an entire directory of files or
>> an entire bucket, etc. and wait until all processing is complete before
>> data is transferred on to the next task. (Additionally, NIFI-7552 updates
>> this to add attributes indicating FlowFile counts for each Output Port so
>> it’s easy to determine if there were any “processing failures” etc.).
>>
>> So with all of the above said, I don’t think that it necessarily solves
>> in a simple and generic sense the requirement to complete Task A, then Task
>> B, and then Task C. But it does put us far closer. This may be achievable
>> still with some nesting of Process Groups, etc. but it won’t be completely
>> as straight-forward as I’d like and would perhaps add significantly latency
>> if it’s allowing only a single FlowFile at a time though the Process Group.
>> Perhaps that can be addressed in the future by having the ability to bulk
>> transfer all FlowFiles from Queue A to Queue B, and then allowing a "Batch
>> Input" on a Process Group instead of just “Streaming" vs. "Single FlowFile
>> at a Time.” I do think there will be some future improvements along these
>> lines, though.
>>
>> 2) This should be fairly straight-forward. It would basically be just
>> creating an assembly like the nifi-assembly module but one that doesn’t
>> include all of the nar’s.
>>
>> 3) This probably boils down to some trade-offs and what makes most sense
>> for your organization. A single, large NiFi deployment makes it much easier
>> for the sys admins, generally. The NiFi policies should provide the needed
>> multi-tenancy in terms of authorization. But it doesn’t really offer much
>> in terms of resource isolation. So, if resource isolation is important to
>> you, then using separate NiFi deployments is likely desirable.
>>
>> Hope this helps!
>> -Mark
>>
>>
>> [1] https://issues.apache.org/jira/browse/NIFI-7476
>> [2] https://issues.apache.org/jira/browse/NIFI-7509
>> [3] https://issues.apache.org/jira/browse/NIFI-7552
>>
>>
>>
>> On Jun 28, 2020, at 1:04 PM, Boris Tyukin  wrote:
>>
>> Hi guys,
>>
>> I am thinking to increase the footprint of NiFi in my org to extend it to
>> less technical roles. I have a few questions:
>>
>> 1) is there any plans to support easy dependencies at some point? We are
>> aware of all the current options (wait-notify, kafka,
>> mergerecord/mergecontent etc.) and all of them are still hard and not
>> reliable. For non-technical roles, we really need very stupid simple way to
>> define classical dependencies like run task C only after task A and B are
>> finished. I realize it is a challenge because of the whole concept of NiFi
>> with flowfiles (which we do love being on a technical side of the house),
>> but I really do not want to get another ETL/scheduling tool.
>>
>> 2) is

Re: Custom service in NAR generation failure

2020-06-19 Thread Mike Thomsen

Without seeing your POM(s), it could be several things. Try posting your
POMs here or as a GitHub gist.

On Fri, Jun 19, 2020 at 3:36 AM Etienne Jouvin 
wrote:

> Hello all.
>
> Do not know where to post the message, guide me if I should send to
> another mailing list.
> A simple summary in first step.
> I created a simple project to build a new service.
> I extend the nifi-nar-bundles artifact with version 1.11.4.
> My project version is currently 0.0.1-SNAPSHOT.
>
> During NAR generation, it failed for the documentation with message :
> org.apache.maven.plugin.MojoExecutionException: Failed to create Extension
> Documentation
> Caused by: org.apache.maven.plugin.MojoExecutionException: Could not
> resolve local dependency org.apache.nifi:nifi-api:jar:0.0.1-SNAPSHOT
>
> I am currently looking in source code of nifi-maven project, specially
> class ExtensionClassLoaderFactory.
>
> What I do not understand is why it searches for version 0.0.1-SNAPSHOT on
> nifi-api, and not the version 1.11.4
>
> Let me know if I should discuss about this in another thread.
>
> Regards
>
> Etienne
>

Re: Understanding what version of MongoDB driver I am currently running.

2020-06-02 Thread Mike Thomsen

All you should need to do is download the 1.11.4 source code, update the
Mongo driver version and rebuild the nifi-mongodb-bundle project to get the
updated NAR files.

On Tue, Jun 2, 2020 at 10:40 AM Enrique Olaizola <
enrique.olaiz...@macrohealth.com> wrote:

> Hello,
>
> My organization is running two versions of NiFi at the moment.
>
> 1.11.1
> 1.11.4
>
> According to the following links:
>
>
> https://mvnrepository.com/artifact/org.apache.nifi/nifi-mongodb-client-service-api/1.11.1
>
> https://mvnrepository.com/artifact/org.apache.nifi/nifi-mongodb-client-service-api/1.11.4
>
> These versions of NiFi come with version 3.2.2 of the MongoDB Java driver.
>
> According to this:
>
> https://docs.mongodb.com/drivers/driver-compatibility-reference
>
> This version of the driver only supports features for MongoDB 3.2
>
> However, in the mvn links provided above there is a column named Updates
> under the Compile dependencies which lists version 3.12.4 of the MongoDB
> driver. If I wanted to update my current instances to use the newer driver
> would it simply be a matter of updating the listed dependencies on these
> pages or is there a more involved process?
>
> Regards,
>
> Enrique
>

Best way to handle XMLTYPE in SQL processors

2020-05-28 Thread Mike Thomsen

We have a really weird query we have to run that generates a XML document
as the result. What is the best way/is it possible to run that query and
convert the output to a string so that we're not fighting with the Avro API?

Thanks,

Mike

Re: Accessing flow attributes from ExecuteStreamCommand

2020-05-28 Thread Mike Thomsen

There's not way at the moment to interact with the NiFi API from that
processor. The closest work around would be to pass in flowfile attributes
as parameters using the parameter configuration field and expression
language.

On Thu, May 28, 2020 at 10:28 AM Jean-Sebastien Vachon <
jsvac...@brizodata.com> wrote:

> Hi all,
>
>
>
> I am using the ExecuteStreamCommand processor to run a python script to
> crunch different data and I was curious
>
> to know if such a processor could both read and/or write from/to the flow
> attributes.
>
>
>
> Can someone point me to the documentation if this is possible? I could not
> find it by myself.
>
>
>
> Thanks
>
>
>
> Sent from Mail  for
> Windows 10
>
>
>

Re: How to deal Schema Evolution with Dataset API

2020-05-11 Thread Mike Thomsen

This should be posted on the Spark user list, not the NiFi one.

On Sat, May 9, 2020 at 3:07 AM Jorge Machado  wrote:

> Hello everyone,
>
> One question to the community.
>
> Imagine I have this
>
> Case class Person(age: int)
>
> spark.read.parquet(“inputPath”).as[Person]
>
>
> After a few weeks of coding I change the class to:
> Case class Person(age: int, name: Option[String] = None)
>
>
> Then when I run the new code on the same input it fails saying that It
> cannot find the name on the schema from the parquet file.
>
> Spark version 2.3.3
>
> How is the best way to guard or fix this? Regenerating all data seems not
> to be a option for us.
>
> Thx
>
>
>
>

Re: Is provenance data preserved when processors are deleted?

2020-05-04 Thread Mike Thomsen

It copies all of the provenance data, and no, there's no way yet to back
the provenance repository with one of those nosql databases yet
unfortunately.

On Mon, May 4, 2020 at 6:40 PM Eric Secules  wrote:

> What information is transmitted by SiteToSiteProvenanceReporting? Is it
> the content, the attributes, and the path the flowfile takes through the
> system? Is there any way to connect the provenance view from NiFi to the
> nosql database instead of the internal provenance storage?
>
> On Mon, May 4, 2020 at 3:07 PM Mike Thomsen 
> wrote:
>
>> One way to do it would be to set up a SiteToSiteProvenanceReporting task
>> and have it send the data to another NiFi instance. That instance can post
>> all of the provenance data into a NoSQL database like Mongo or
>> Elasticsearch very quickly.
>>
>> On Mon, May 4, 2020 at 5:47 PM Eric Secules  wrote:
>>
>>> Hello everyone,
>>>
>>> If I am upgrading a process group to the latest version, do you know
>>> whether provenance is preserved for processors that may get deleted in the
>>> upgrade?
>>> I have noticed that if I delete my process group and redownload it from
>>> the registry, I am no longer able to see the provenance data from flowfiles
>>> that went through the first process group.
>>>
>>> What is the best way to view and archive provenance data for older
>>> versions of flows? For background I am running NiFi in a docker container.
>>> I think I might have to archive the currently running container and
>>> bring the new version up on a new container.
>>>
>>> Thanks,
>>> Eric
>>>
>>

Re: Is provenance data preserved when processors are deleted?

2020-05-04 Thread Mike Thomsen

One way to do it would be to set up a SiteToSiteProvenanceReporting task
and have it send the data to another NiFi instance. That instance can post
all of the provenance data into a NoSQL database like Mongo or
Elasticsearch very quickly.

On Mon, May 4, 2020 at 5:47 PM Eric Secules  wrote:

> Hello everyone,
>
> If I am upgrading a process group to the latest version, do you know
> whether provenance is preserved for processors that may get deleted in the
> upgrade?
> I have noticed that if I delete my process group and redownload it from
> the registry, I am no longer able to see the provenance data from flowfiles
> that went through the first process group.
>
> What is the best way to view and archive provenance data for older
> versions of flows? For background I am running NiFi in a docker container.
> I think I might have to archive the currently running container and bring
> the new version up on a new container.
>
> Thanks,
> Eric
>

Re: How to use delta storage format

2020-04-05 Thread Mike Thomsen

Paul,

> What it does is it basically runs a local-mode (in-process) spark to read
the log.

That's unfortunately not particularly scalable unless I'm missing
something. I think the easiest path to accomplish this would be to build a
NiFi flow that generates Parquet files, uploads them into a S3 bucket and
then periodically run a Spark job to read the entire bucket and merge the
files into the table. I've done something simple like this in a personal
experiment. It's particularly easy now that we have Record API components
that will directly generate Parquet output files from an Avro schema and a
record set.

On Tue, Mar 31, 2020 at 11:08 AM Paul Parker  wrote:

> Let me share answers from the delta community:
>
> Answer to Q1:
> Structured streaming queries can do commits every minute, even every 20-30
> seconds. This definitely creates small files. But that is okay, because it
> is expected that people will periodically compact the files. The same
> timing should work fine for Nifi and any other streaming engine. It does
> create 1000-ish versions per day, but that is okay.
>
> Answer to Q2:
> That is up to the sink implementation. Both are okay. In fact, it probably
> can be combination of both, as long as we dont commit every second. That
> may not scale well.
>
> Answer to Q3:
> You need a primary node which is responsible for managing the Delta table.
> That note would be responsible for reading the log, parsing it, updating
> it, etc. Unfortunately, we have no good non-spark way to read the log, and
> much less write to the log.
> there is an experimental uber jar that tries to package delta + spark
> together into a single jar ... using which you could read the log. its
> available here - https://github.com/delta-io/connectors/
>
> What it does is it basically runs a local-mode (in-process) spark to read
> the log. This what we are using to build a hive connector, that will allow
> hive to read delta files. Now the goal of that was to only read. your goal
> is to write which is definitely more complicated, because for that you have
> to do much more. Now that uber jar has all the necessary code to do the
> writing. ... which you could use. but there has to be a driver node which
> has to collect all the parquet files written by other nodes and atomically
> commit those parquet files to the Delta log to make them visible to all
> readers.
>
> Does the first orientation help?
>
>
> Mike Thomsen  schrieb am So., 29. März 2020,
> 20:29:
>
>> It looks like a lot of their connectors rely on external management by
>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>> documentation. Read some of the fine print near the bottom of this to get
>> an idea of what I mean:
>>
>> https://github.com/delta-io/connectors
>>
>> Hypothetically, we could build a connector for NiFi, but there are some
>> things about the design of Delta Lake that I am not sure about based on my
>> own research and what not. Roughly, they are the following:
>>
>> 1. What is a good timing strategy for doing commits to Delta Lake?
>> 2. What would trigger a commit in the first place?
>> 3. Is there a good way to trigger commits that would work within the
>> masterless cluster design of clustered NiFi instead of requiring a special
>> "primary node only" processor for executing commits?
>>
>> Based on my experimentation, one of the biggest questions around the
>> first point is do you really want potentially thousands or tens of
>> thousands of time shift events to be created throughout the day? A record
>> processor that reads in a ton of small record sets and injects that into
>> the Delta Lake would create a ton of these checkpoints, and they'd be
>> largely meaningless to people trying to make sense of them for the purpose
>> of going back and forth in time between versions.
>>
>> Do we trigger a commit per record set or set a timer?
>>
>> Most of us on the NiFi dev side have no real experience here. It would be
>> helpful for us to get some ideas to form use cases from the community
>> because there are some big gaps on how we'd even start to shape the
>> requirements.
>>
>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker 
>> wrote:
>>
>>> Hi Mike,
>>> your alternate suggestion sounds good. But how does it work if I want to
>>> keep this running continuously? In other words, the delta table should be
>>> continuously updated. Finally, this is one of the biggest advantages of
>>> Delta: you can ingest batch and streaming data into one table.
>>>
>>> I also think about workarounds (Use Athena, Presto or Redshift with

Re: How to use delta storage format

2020-04-02 Thread Mike Thomsen

Interesting. You've given me a lot to think about in terms of designing
this.

On Thu, Apr 2, 2020 at 10:43 AM Paul Parker  wrote:

> @Mike I'd appreciate some feedback.
>
> Paul Parker  schrieb am Di., 31. März 2020, 17:07:
>
>> Let me share answers from the delta community:
>>
>> Answer to Q1:
>> Structured streaming queries can do commits every minute, even every
>> 20-30 seconds. This definitely creates small files. But that is okay,
>> because it is expected that people will periodically compact the files. The
>> same timing should work fine for Nifi and any other streaming engine. It
>> does create 1000-ish versions per day, but that is okay.
>>
>> Answer to Q2:
>> That is up to the sink implementation. Both are okay. In fact, it
>> probably can be combination of both, as long as we dont commit every
>> second. That may not scale well.
>>
>> Answer to Q3:
>> You need a primary node which is responsible for managing the Delta
>> table. That note would be responsible for reading the log, parsing it,
>> updating it, etc. Unfortunately, we have no good non-spark way to read the
>> log, and much less write to the log.
>> there is an experimental uber jar that tries to package delta + spark
>> together into a single jar ... using which you could read the log. its
>> available here - https://github.com/delta-io/connectors/
>>
>> What it does is it basically runs a local-mode (in-process) spark to read
>> the log. This what we are using to build a hive connector, that will allow
>> hive to read delta files. Now the goal of that was to only read. your goal
>> is to write which is definitely more complicated, because for that you have
>> to do much more. Now that uber jar has all the necessary code to do the
>> writing. ... which you could use. but there has to be a driver node which
>> has to collect all the parquet files written by other nodes and atomically
>> commit those parquet files to the Delta log to make them visible to all
>> readers.
>>
>> Does the first orientation help?
>>
>>
>> Mike Thomsen  schrieb am So., 29. März 2020,
>> 20:29:
>>
>>> It looks like a lot of their connectors rely on external management by
>>> Spark. That is true of Hive, and also of Athena/Presto unless I misread the
>>> documentation. Read some of the fine print near the bottom of this to get
>>> an idea of what I mean:
>>>
>>> https://github.com/delta-io/connectors
>>>
>>> Hypothetically, we could build a connector for NiFi, but there are some
>>> things about the design of Delta Lake that I am not sure about based on my
>>> own research and what not. Roughly, they are the following:
>>>
>>> 1. What is a good timing strategy for doing commits to Delta Lake?
>>> 2. What would trigger a commit in the first place?
>>> 3. Is there a good way to trigger commits that would work within the
>>> masterless cluster design of clustered NiFi instead of requiring a special
>>> "primary node only" processor for executing commits?
>>>
>>> Based on my experimentation, one of the biggest questions around the
>>> first point is do you really want potentially thousands or tens of
>>> thousands of time shift events to be created throughout the day? A record
>>> processor that reads in a ton of small record sets and injects that into
>>> the Delta Lake would create a ton of these checkpoints, and they'd be
>>> largely meaningless to people trying to make sense of them for the purpose
>>> of going back and forth in time between versions.
>>>
>>> Do we trigger a commit per record set or set a timer?
>>>
>>> Most of us on the NiFi dev side have no real experience here. It would
>>> be helpful for us to get some ideas to form use cases from the community
>>> because there are some big gaps on how we'd even start to shape the
>>> requirements.
>>>
>>> On Sun, Mar 29, 2020 at 1:28 PM Paul Parker 
>>> wrote:
>>>
>>>> Hi Mike,
>>>> your alternate suggestion sounds good. But how does it work if I want
>>>> to keep this running continuously? In other words, the delta table should
>>>> be continuously updated. Finally, this is one of the biggest advantages of
>>>> Delta: you can ingest batch and streaming data into one table.
>>>>
>>>> I also think about workarounds (Use Athena, Presto or Redshift with
>>>> Nifi):
>>>> "Here is the list of integrations that enable you to access Delta

Re: Performance of adding many keys to redis with PutDistributedMapCache

2020-03-31 Thread Mike Thomsen

Might be worth experimenting with KeyDB and see if that helps. It's a
mutli-threaded fork of Redis that's supposedly about as fast in a single
node as a same size Redis cluster when you compare cluster nodes to KeyDB
thread pool size.

https://keydb.dev/

On Tue, Mar 31, 2020 at 4:49 PM Bryan Bende  wrote:

> Hi Brian,
>
> I'm not sure what can really be done with the existing processor besides
> what you have already done. Have you configured your overall Timer Driven
> thread pool appropriately?
>
> Most likely there would need to be a new PutRedis processor that didn't
> have to adhere to the DistributedMapCacheInterface and could use MSET or
> whatever specific Redis functionality was needed.
>
> Another option might be a record-based variation of PutDistributedMapCache
> where you could keep thousands of records together and stream them to the
> cache. It would take a record-path to specify the key for each record and
> serialize the record as the value (assuming your data fits into one of the
> record formats like JSON, Avro, CSV).
>
> -Bryan
>
> On Tue, Mar 31, 2020 at 4:23 PM Hesselmann, Brian <
> brian.hesselm...@cgi.com> wrote:
>
>> Hi,
>>
>> We currently run a flow that puts about 700.000 entries/flowfiles into
>> Redis every 5 minutes. I'm looking for ways to improve performance.
>>
>> Currently we've been upping the number of concurrent tasks and run
>> duration of the PutDistributedMapCache processor to be able to process
>> everything. I know Redis supports setting multiple keys at once using MSET(
>> https://redis.io/commands/mset), however using Nifi this command is not
>> available.
>>
>> Short of simply upgrading the system we run Nifi/Redis on, do you have
>> any suggestions for improving performance of PutDistributedMapCache?
>>
>> Best,
>> Brian
>>
>

1 2 3 4 5 >

1 - 100 of 440 matches

Mail list logo