Re: Nifi hardware recommendation

2016-10-13 Thread Joe Witt
Ali,

You have a lot of nice resources to work with there.  I'd recommend the
series of RAID-1 configuration personally provided you keep in mind this
means you can only lose a single disk for any one partition.  As long as
they're being monitored and would be quickly replaced this in practice
works well.  If there could be lapses in monitoring or time to replace then
it is perhaps safer to go with more redundancy or an alternative RAID type.

I'd say do the OS, app installs w/user and audit db stuff, application logs
on one physical RAID volume.  Have a dedicated physical volume for the flow
file repository.  It will not be able to use all the space but it certainly
could benefit from having no other contention.  This could be a great thing
to have SSDs for actually.  And for the remaining volumes split them up for
content and provenance as you have.  You get to make the overall
performance versus retention decision.  Frankly, you have a great system to
work with and I suspect you're going to see excellent results anyway.

Conservatively speaking expect say 50MB/s of throughput per volume in the
content repository so if you end up with 8 of them could achieve upwards of
400MB/s sustained.  You'll also then want to make sure you have a good 10G
based network setup as well.  Or, you could dial back on the speed tradeoff
and simply increase retention or disk loss tolerance.  Lots of ways to play
the game.

There are no published SSD vs HDD performance benchmarks that I am aware of
though this is a good idea.  Having a hybrid of SSDs and HDDs could offer a
really solid performance/retention/cost tradeoff.  For example having SSDs
for the OS/logs/provenance/flowfile with HDDs for the content - that would
be quite nice.  At that rate to take full advantage of the system you'd
need to have very strong network infrastructure between NiFi and any
systems it is interfacing with  and your flows would need to be well tuned
for GC/memory efficiency.

Thanks
Joe

On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian  wrote:

> Dear Nifi Users/ developers,
> Hi,
>
> I was wondering is there any benchmark about the question that is it
> better to dedicate disk control to Nifi or using RAID for this purpose? For
> example, which of these scenarios is recommended from the performance point
> of view?
> Scenario 1:
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 2 disk- raid 1 for provenance repo1
> 2 disk- raid 1 for provenance repo2
> 2 disk- raid 1 for content repo1
> 2 disk- raid 1 for content repo2
> 2 disk- raid 1 for content repo3
> 2 disk- raid 1 for content repo4
> 2 disk- raid 1 for content repo5
> 2 disk- raid 1 for content repo6
> 2 disk- raid 1 for content repo7
> 2 disk- raid 1 for content repo8
> 2 disk- raid 1 for content repo9
>
>
> Scenario 2:
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 4 disk- raid 10 for provenance repo1
> 18 disk- raid 10 for content repo1
>
> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
> Thank you very much.
>
> Best regards,
> Ali
>


Re: Nifi hardware recommendation

2016-10-13 Thread Ali Nazemian
Dear Joe,

Thank you very much. That was a really great explanation.
I investigated the Nifi architecture, and it seems that most of the
read/write operations for flow file repo and provenance repo are random.
However, for content repo most of the read/write operations are sequential.
Let's say cost does not matter. In this case, even choosing SSD for content
repo can not provide huge performance gain instead of HDD. Am I right?
Hence, it would be better to spend content repo SSD money on network
infrastructure.

Best regards,
Ali

On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt  wrote:

> Ali,
>
> You have a lot of nice resources to work with there.  I'd recommend the
> series of RAID-1 configuration personally provided you keep in mind this
> means you can only lose a single disk for any one partition.  As long as
> they're being monitored and would be quickly replaced this in practice
> works well.  If there could be lapses in monitoring or time to replace then
> it is perhaps safer to go with more redundancy or an alternative RAID type.
>
> I'd say do the OS, app installs w/user and audit db stuff, application
> logs on one physical RAID volume.  Have a dedicated physical volume for the
> flow file repository.  It will not be able to use all the space but it
> certainly could benefit from having no other contention.  This could be a
> great thing to have SSDs for actually.  And for the remaining volumes split
> them up for content and provenance as you have.  You get to make the
> overall performance versus retention decision.  Frankly, you have a great
> system to work with and I suspect you're going to see excellent results
> anyway.
>
> Conservatively speaking expect say 50MB/s of throughput per volume in the
> content repository so if you end up with 8 of them could achieve upwards of
> 400MB/s sustained.  You'll also then want to make sure you have a good 10G
> based network setup as well.  Or, you could dial back on the speed tradeoff
> and simply increase retention or disk loss tolerance.  Lots of ways to play
> the game.
>
> There are no published SSD vs HDD performance benchmarks that I am aware
> of though this is a good idea.  Having a hybrid of SSDs and HDDs could
> offer a really solid performance/retention/cost tradeoff.  For example
> having SSDs for the OS/logs/provenance/flowfile with HDDs for the content -
> that would be quite nice.  At that rate to take full advantage of the
> system you'd need to have very strong network infrastructure between NiFi
> and any systems it is interfacing with  and your flows would need to be
> well tuned for GC/memory efficiency.
>
> Thanks
> Joe
>
> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian 
> wrote:
>
>> Dear Nifi Users/ developers,
>> Hi,
>>
>> I was wondering is there any benchmark about the question that is it
>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>> example, which of these scenarios is recommended from the performance point
>> of view?
>> Scenario 1:
>> 24 disk in total
>> 2 disk- raid 1 for OS and fileflow repo
>> 2 disk- raid 1 for provenance repo1
>> 2 disk- raid 1 for provenance repo2
>> 2 disk- raid 1 for content repo1
>> 2 disk- raid 1 for content repo2
>> 2 disk- raid 1 for content repo3
>> 2 disk- raid 1 for content repo4
>> 2 disk- raid 1 for content repo5
>> 2 disk- raid 1 for content repo6
>> 2 disk- raid 1 for content repo7
>> 2 disk- raid 1 for content repo8
>> 2 disk- raid 1 for content repo9
>>
>>
>> Scenario 2:
>> 24 disk in total
>> 2 disk- raid 1 for OS and fileflow repo
>> 4 disk- raid 10 for provenance repo1
>> 18 disk- raid 10 for content repo1
>>
>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>> Thank you very much.
>>
>> Best regards,
>> Ali
>>
>
>


-- 
A.Nazemian


Re: Nifi hardware recommendation

2016-10-13 Thread Joe Witt
Ali,

I agree with your assumption.  It would be great to test that out and
provide some numbers but intuitively I agree.

I could envision certain scatter/gather data flows that could challenge
that sequential access assumption but honestly with how awesome disk
caching is in Linux these days in think practically speaking this is the
right way to think about it.

Thanks
Joe

On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian  wrote:

> Dear Joe,
>
> Thank you very much. That was a really great explanation.
> I investigated the Nifi architecture, and it seems that most of the
> read/write operations for flow file repo and provenance repo are random.
> However, for content repo most of the read/write operations are sequential.
> Let's say cost does not matter. In this case, even choosing SSD for content
> repo can not provide huge performance gain instead of HDD. Am I right?
> Hence, it would be better to spend content repo SSD money on network
> infrastructure.
>
> Best regards,
> Ali
>
> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt  wrote:
>
>> Ali,
>>
>> You have a lot of nice resources to work with there.  I'd recommend the
>> series of RAID-1 configuration personally provided you keep in mind this
>> means you can only lose a single disk for any one partition.  As long as
>> they're being monitored and would be quickly replaced this in practice
>> works well.  If there could be lapses in monitoring or time to replace then
>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>
>> I'd say do the OS, app installs w/user and audit db stuff, application
>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>> flow file repository.  It will not be able to use all the space but it
>> certainly could benefit from having no other contention.  This could be a
>> great thing to have SSDs for actually.  And for the remaining volumes split
>> them up for content and provenance as you have.  You get to make the
>> overall performance versus retention decision.  Frankly, you have a great
>> system to work with and I suspect you're going to see excellent results
>> anyway.
>>
>> Conservatively speaking expect say 50MB/s of throughput per volume in the
>> content repository so if you end up with 8 of them could achieve upwards of
>> 400MB/s sustained.  You'll also then want to make sure you have a good 10G
>> based network setup as well.  Or, you could dial back on the speed tradeoff
>> and simply increase retention or disk loss tolerance.  Lots of ways to play
>> the game.
>>
>> There are no published SSD vs HDD performance benchmarks that I am aware
>> of though this is a good idea.  Having a hybrid of SSDs and HDDs could
>> offer a really solid performance/retention/cost tradeoff.  For example
>> having SSDs for the OS/logs/provenance/flowfile with HDDs for the content -
>> that would be quite nice.  At that rate to take full advantage of the
>> system you'd need to have very strong network infrastructure between NiFi
>> and any systems it is interfacing with  and your flows would need to be
>> well tuned for GC/memory efficiency.
>>
>> Thanks
>> Joe
>>
>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian 
>> wrote:
>>
>>> Dear Nifi Users/ developers,
>>> Hi,
>>>
>>> I was wondering is there any benchmark about the question that is it
>>> better to dedicate disk control to Nifi or using RAID for this purpose? For
>>> example, which of these scenarios is recommended from the performance point
>>> of view?
>>> Scenario 1:
>>> 24 disk in total
>>> 2 disk- raid 1 for OS and fileflow repo
>>> 2 disk- raid 1 for provenance repo1
>>> 2 disk- raid 1 for provenance repo2
>>> 2 disk- raid 1 for content repo1
>>> 2 disk- raid 1 for content repo2
>>> 2 disk- raid 1 for content repo3
>>> 2 disk- raid 1 for content repo4
>>> 2 disk- raid 1 for content repo5
>>> 2 disk- raid 1 for content repo6
>>> 2 disk- raid 1 for content repo7
>>> 2 disk- raid 1 for content repo8
>>> 2 disk- raid 1 for content repo9
>>>
>>>
>>> Scenario 2:
>>> 24 disk in total
>>> 2 disk- raid 1 for OS and fileflow repo
>>> 4 disk- raid 10 for provenance repo1
>>> 18 disk- raid 10 for content repo1
>>>
>>> Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
>>> Thank you very much.
>>>
>>> Best regards,
>>> Ali
>>>
>>
>>
>
>
> --
> A.Nazemian
>


Re: Nifi hardware recommendation

2016-10-13 Thread Ali Nazemian
Thank you very much.
I would be more than happy to provide some benchmark results after the
implementation.
Sincerely yours,
Ali

On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt  wrote:

> Ali,
>
> I agree with your assumption.  It would be great to test that out and
> provide some numbers but intuitively I agree.
>
> I could envision certain scatter/gather data flows that could challenge
> that sequential access assumption but honestly with how awesome disk
> caching is in Linux these days in think practically speaking this is the
> right way to think about it.
>
> Thanks
> Joe
>
> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian 
> wrote:
>
>> Dear Joe,
>>
>> Thank you very much. That was a really great explanation.
>> I investigated the Nifi architecture, and it seems that most of the
>> read/write operations for flow file repo and provenance repo are random.
>> However, for content repo most of the read/write operations are sequential.
>> Let's say cost does not matter. In this case, even choosing SSD for content
>> repo can not provide huge performance gain instead of HDD. Am I right?
>> Hence, it would be better to spend content repo SSD money on network
>> infrastructure.
>>
>> Best regards,
>> Ali
>>
>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt  wrote:
>>
>>> Ali,
>>>
>>> You have a lot of nice resources to work with there.  I'd recommend the
>>> series of RAID-1 configuration personally provided you keep in mind this
>>> means you can only lose a single disk for any one partition.  As long as
>>> they're being monitored and would be quickly replaced this in practice
>>> works well.  If there could be lapses in monitoring or time to replace then
>>> it is perhaps safer to go with more redundancy or an alternative RAID type.
>>>
>>> I'd say do the OS, app installs w/user and audit db stuff, application
>>> logs on one physical RAID volume.  Have a dedicated physical volume for the
>>> flow file repository.  It will not be able to use all the space but it
>>> certainly could benefit from having no other contention.  This could be a
>>> great thing to have SSDs for actually.  And for the remaining volumes split
>>> them up for content and provenance as you have.  You get to make the
>>> overall performance versus retention decision.  Frankly, you have a great
>>> system to work with and I suspect you're going to see excellent results
>>> anyway.
>>>
>>> Conservatively speaking expect say 50MB/s of throughput per volume in
>>> the content repository so if you end up with 8 of them could achieve
>>> upwards of 400MB/s sustained.  You'll also then want to make sure you have
>>> a good 10G based network setup as well.  Or, you could dial back on the
>>> speed tradeoff and simply increase retention or disk loss tolerance.  Lots
>>> of ways to play the game.
>>>
>>> There are no published SSD vs HDD performance benchmarks that I am aware
>>> of though this is a good idea.  Having a hybrid of SSDs and HDDs could
>>> offer a really solid performance/retention/cost tradeoff.  For example
>>> having SSDs for the OS/logs/provenance/flowfile with HDDs for the content -
>>> that would be quite nice.  At that rate to take full advantage of the
>>> system you'd need to have very strong network infrastructure between NiFi
>>> and any systems it is interfacing with  and your flows would need to be
>>> well tuned for GC/memory efficiency.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian 
>>> wrote:
>>>
 Dear Nifi Users/ developers,
 Hi,

 I was wondering is there any benchmark about the question that is it
 better to dedicate disk control to Nifi or using RAID for this purpose? For
 example, which of these scenarios is recommended from the performance point
 of view?
 Scenario 1:
 24 disk in total
 2 disk- raid 1 for OS and fileflow repo
 2 disk- raid 1 for provenance repo1
 2 disk- raid 1 for provenance repo2
 2 disk- raid 1 for content repo1
 2 disk- raid 1 for content repo2
 2 disk- raid 1 for content repo3
 2 disk- raid 1 for content repo4
 2 disk- raid 1 for content repo5
 2 disk- raid 1 for content repo6
 2 disk- raid 1 for content repo7
 2 disk- raid 1 for content repo8
 2 disk- raid 1 for content repo9


 Scenario 2:
 24 disk in total
 2 disk- raid 1 for OS and fileflow repo
 4 disk- raid 10 for provenance repo1
 18 disk- raid 10 for content repo1

 Moreover, is there any benchmark for SSD vs HDD performance for Nifi?
 Thank you very much.

 Best regards,
 Ali

>>>
>>>
>>
>>
>> --
>> A.Nazemian
>>
>
>


-- 
A.Nazemian


Re: Rest API Client swagger.json

2016-10-13 Thread Matt Gilman
Stephane,

Yes, you are correct that Apache NiFi uses swagger. However, we are only
using it for keeping the documentation in sync. We use a maven plugin that
inspects the swagger annotations and generates a swagger.json. The
swagger.json is generated to nifi-web-api/target/swagger-ui/swagger.json at
build time. Subsequently, the swagger.json is run through a handlebars
template to generate the REST API docs.

We provide a client library at


org.apache.nifi
nifi-client-dto
1.0.0


Examples of its usage can be seen in our access control integration tests
[1].

Let me know if you have any other questions. Thanks!

Matt

[1]
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/test/java/org/apache/nifi/integration/accesscontrol/ITProcessorAccessControl.java

On Wed, Oct 12, 2016 at 10:53 PM, Stéphane Maarek  wrote:

> Hi,
>
> It seems possible to create an API client for any language using this
> project:
> https://github.com/swagger-api/swagger-codegen
>
> It needs the swagger.json file. I know it should be generated at build
> time, but where can I find it?
>
> Beyond that, would it be useful to extract that file, version control it,
> and maybe automatically generate API sdks for many languages using the
> project above? Would help tremendously
>
> Cheers
> Stephane
>


NiFi for backup solution

2016-10-13 Thread Gop Krr
Hi All,
I am learning NiFi as well as trying to deploy it in production for few
 use cases. One of the use case is ETL and another use case is, using NiFi
as a backup solution, where it takes the data from one source and moves to
another database|file. Is anyone using NiFi for this purpose? Does NiFi
support incremental data move?
It would be awesome if someone can point me to right documentation.
Thanks
Rai


Re: NiFi for backup solution

2016-10-13 Thread Joe Witt
Rai,

NiFi can certainly be used for some data replication scenarios and
quite often is.  If you can treat the source like a continuous data
source then there is some way to keep state about what has been pulled
already, what has changed or needs yet to be pulled, and it can just
keep running then generally speaking it will work out well.  Depending
on how the flow is setup, error conditions that can occur in remote
delivery, and cluster topology NiFi won't be able to ensure the order
that data is received is the order in which data is delivered.  So, if
you need to ensure data is copied in precisely the same order (like
log replication) and each object/message/event is on the order of KBs
in size then I'd recommend looking at Apache Kafka and Kafka Connect's
support for keeping things ordered within the same partition of the
same topic.

Thanks
Joe

On Thu, Oct 13, 2016 at 11:05 AM, Gop Krr  wrote:
> Hi All,
> I am learning NiFi as well as trying to deploy it in production for few  use
> cases. One of the use case is ETL and another use case is, using NiFi as a
> backup solution, where it takes the data from one source and moves to
> another database|file. Is anyone using NiFi for this purpose? Does NiFi
> support incremental data move?
> It would be awesome if someone can point me to right documentation.
> Thanks
> Rai


Re: NiFi for backup solution

2016-10-13 Thread Matt Burgess
Rai,

There are incremental data movement processors in NiFi depending on
your source/target. For example, if your sources are files, you can
use ListFile in combination with FetchFile, the former will keep track
of which files it has found thus far, so if you put new files into the
location (or update existing ones), only those new/updated files will
be processed the next time.

For database (RDBMS) sources, there are the QueryDatabaseTable and
GenerateTableFetch processors, which support the idea of "maximum
value columns", such that for each of said columns, the processor(s)
will keep track of the maximum value observed in that column, then for
future executions of the processor, only rows whose values in those
columns exceed the currently-observed maximum will be retrieved, then
the maximum will be updated, and so forth.

The Usage documentation for these processors can be found at
https://nifi.apache.org/docs.html (left-hand side under Processors).

Regards,
Matt

On Thu, Oct 13, 2016 at 11:05 AM, Gop Krr  wrote:
> Hi All,
> I am learning NiFi as well as trying to deploy it in production for few  use
> cases. One of the use case is ETL and another use case is, using NiFi as a
> backup solution, where it takes the data from one source and moves to
> another database|file. Is anyone using NiFi for this purpose? Does NiFi
> support incremental data move?
> It would be awesome if someone can point me to right documentation.
> Thanks
> Rai


Re: NiFi for backup solution

2016-10-13 Thread Gop Krr
Thanks Joe and Matt.
@Joe, based on your comment, I need to use NiFi as a producer which puts
the data on Kafka queue and then have NiFi consumer, which writes the data
back to the destination. Is my understanding correct?

@Matt, My use case is for the DynamoDB. I will look into whether
incremental copy is supported for Dynamodb.
Thanks again and felt so good to see the vibrant community. I got my
questions answered within five minutes. Kudos to NiFi community.

On Thu, Oct 13, 2016 at 8:17 AM, Matt Burgess  wrote:

> Rai,
>
> There are incremental data movement processors in NiFi depending on
> your source/target. For example, if your sources are files, you can
> use ListFile in combination with FetchFile, the former will keep track
> of which files it has found thus far, so if you put new files into the
> location (or update existing ones), only those new/updated files will
> be processed the next time.
>
> For database (RDBMS) sources, there are the QueryDatabaseTable and
> GenerateTableFetch processors, which support the idea of "maximum
> value columns", such that for each of said columns, the processor(s)
> will keep track of the maximum value observed in that column, then for
> future executions of the processor, only rows whose values in those
> columns exceed the currently-observed maximum will be retrieved, then
> the maximum will be updated, and so forth.
>
> The Usage documentation for these processors can be found at
> https://nifi.apache.org/docs.html (left-hand side under Processors).
>
> Regards,
> Matt
>
> On Thu, Oct 13, 2016 at 11:05 AM, Gop Krr  wrote:
> > Hi All,
> > I am learning NiFi as well as trying to deploy it in production for few
> use
> > cases. One of the use case is ETL and another use case is, using NiFi as
> a
> > backup solution, where it takes the data from one source and moves to
> > another database|file. Is anyone using NiFi for this purpose? Does NiFi
> > support incremental data move?
> > It would be awesome if someone can point me to right documentation.
> > Thanks
> > Rai
>


Re: NiFi for backup solution

2016-10-13 Thread Joe Witt
You'd only need to do that if you have strict ordering requirements like
reading directly from a transaction log and replicating it.  If yes I'd
skip nifi unless your also doing other cases with it.

Sounds like Matts path gets you going though so that might work out just
fine.

Thanks
Joe

On Oct 13, 2016 11:25 AM, "Gop Krr"  wrote:

> Thanks Joe and Matt.
> @Joe, based on your comment, I need to use NiFi as a producer which puts
> the data on Kafka queue and then have NiFi consumer, which writes the data
> back to the destination. Is my understanding correct?
>
> @Matt, My use case is for the DynamoDB. I will look into whether
> incremental copy is supported for Dynamodb.
> Thanks again and felt so good to see the vibrant community. I got my
> questions answered within five minutes. Kudos to NiFi community.
>
> On Thu, Oct 13, 2016 at 8:17 AM, Matt Burgess 
> wrote:
>
>> Rai,
>>
>> There are incremental data movement processors in NiFi depending on
>> your source/target. For example, if your sources are files, you can
>> use ListFile in combination with FetchFile, the former will keep track
>> of which files it has found thus far, so if you put new files into the
>> location (or update existing ones), only those new/updated files will
>> be processed the next time.
>>
>> For database (RDBMS) sources, there are the QueryDatabaseTable and
>> GenerateTableFetch processors, which support the idea of "maximum
>> value columns", such that for each of said columns, the processor(s)
>> will keep track of the maximum value observed in that column, then for
>> future executions of the processor, only rows whose values in those
>> columns exceed the currently-observed maximum will be retrieved, then
>> the maximum will be updated, and so forth.
>>
>> The Usage documentation for these processors can be found at
>> https://nifi.apache.org/docs.html (left-hand side under Processors).
>>
>> Regards,
>> Matt
>>
>> On Thu, Oct 13, 2016 at 11:05 AM, Gop Krr  wrote:
>> > Hi All,
>> > I am learning NiFi as well as trying to deploy it in production for
>> few  use
>> > cases. One of the use case is ETL and another use case is, using NiFi
>> as a
>> > backup solution, where it takes the data from one source and moves to
>> > another database|file. Is anyone using NiFi for this purpose? Does NiFi
>> > support incremental data move?
>> > It would be awesome if someone can point me to right documentation.
>> > Thanks
>> > Rai
>>
>
>


Book/Training for NiFi

2016-10-13 Thread Gop Krr
Hi All,
Is there any book for apache NiFi?
Also, does Hortonworks conducts training for NiFi?
Thanks
Rai


Re: Book/Training for NiFi

2016-10-13 Thread Andy LoPresto
Hi Rai,

There are some excellent documents on the Apache NiFi site [1] to help you 
learn. There is an Administrator Guide [2], a User Guide [3], a Developer Guide 
[4], a NiFi In-Depth document [5], an Expression Language Guide [6] and 
processor and component documentation [7] as well. Currently, we are unaware of 
any official “book” resources.

Any corporate offerings are separate from the Apache project and should be 
investigated with said company.

[1] https://nifi.apache.org/ 
[2] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
[3] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
[4] https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
[5] https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
[6] https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
[7] https://nifi.apache.org/docs.html 

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 13, 2016, at 10:28 AM, Gop Krr  wrote:
> 
> Hi All,
> Is there any book for apache NiFi?
> Also, does Hortonworks conducts training for NiFi?
> Thanks
> Rai



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Book/Training for NiFi

2016-10-13 Thread Gop Krr
Thanks Andy. Appreciate your guidance.

On Thu, Oct 13, 2016 at 10:39 AM, Andy LoPresto 
wrote:

> Hi Rai,
>
> There are some excellent documents on the Apache NiFi site [1] to help you
> learn. There is an Administrator Guide [2], a User Guide [3], a Developer
> Guide [4], a NiFi In-Depth document [5], an Expression Language Guide [6]
> and processor and component documentation [7] as well. Currently, we are
> unaware of any official “book” resources.
>
> Any corporate offerings are separate from the Apache project and should be
> investigated with said company.
>
> [1] https://nifi.apache.org/
> [2] https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html
> [3] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html
> [4] https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
> [5] https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html
> [6] https://nifi.apache.org/docs/nifi-docs/html/
> expression-language-guide.html
> [7] https://nifi.apache.org/docs.html
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Oct 13, 2016, at 10:28 AM, Gop Krr  wrote:
>
> Hi All,
> Is there any book for apache NiFi?
> Also, does Hortonworks conducts training for NiFi?
> Thanks
> Rai
>
>
>


PutDynamoDB processor

2016-10-13 Thread Gop Krr
Hi All,
I have been trying to use get and load processor for the dynamodb and I am
almost there. I am able to run the get processor and I see, data is flowing
:)
But I see the following error in my nifi-app.log file:

2016-10-13 18:02:38,823 ERROR [Timer-Driven Process Thread-9]
o.a.n.p.aws.dynamodb.GetDynamoDB
GetDynamoDB[id=7d906337-0157-1000-5868-479d0e0e3580] Hash key value '' is
required for flow file
StandardFlowFileRecord[uuid=44554c23-1618-47db-b46e-04ffd737748e,claim=StandardContentClaim
[resourceClaim=StandardResourceClaim[id=1476381755460-37287,
container=default, section=423], offset=0,
length=1048576],offset=0,name=2503473718684086,size=1048576]


I understand that, its looking for the Hash Key Value but I am not sure,
how do I pass it.  In the setting tab, nifi automatically populates
this: ${dynamodb.item.hash.key.value} but looks like this is not the right
way to do it. Can I get some guidance on this? Thanks for all the help.

Best,

Rai


GetKafka maximum fetch size

2016-10-13 Thread Igor Kravzov
Hi,

I am getting the following exception in nifi-0.6.1:

kafka.common.MessageSizeTooLargeException: Found a message larger than the
maximum fetch size of this consumer. Increase the fetch size, or decrease
the maximum message size the broker will allow.

What is the max size? How can I increase max size?

Thanks in advance.


Re: PutDynamoDB processor

2016-10-13 Thread James Wing
Rai,

The GetDynamoDB processor requires a hash key value to look up an item in
the table.  The default setting is an Expression Language statement that
reads the hash key value from a flowfile attribute,
dynamodb.item.hash.key.value.  But this is not required.  You can change it
to any attribute expression ${my.hash.key}, or even hard-code a single key
"item123" if you wish.

Does that help?

Thanks,

James

On Thu, Oct 13, 2016 at 12:17 PM, Gop Krr  wrote:

> Hi All,
> I have been trying to use get and load processor for the dynamodb and I am
> almost there. I am able to run the get processor and I see, data is flowing
> :)
> But I see the following error in my nifi-app.log file:
>
> 2016-10-13 18:02:38,823 ERROR [Timer-Driven Process Thread-9]
> o.a.n.p.aws.dynamodb.GetDynamoDB 
> GetDynamoDB[id=7d906337-0157-1000-5868-479d0e0e3580]
> Hash key value '' is required for flow file StandardFlowFileRecord[uuid=
> 44554c23-1618-47db-b46e-04ffd737748e,claim=StandardContentClaim
> [resourceClaim=StandardResourceClaim[id=1476381755460-37287,
> container=default, section=423], offset=0, length=1048576],offset=0,name=
> 2503473718684086,size=1048576]
>
>
> I understand that, its looking for the Hash Key Value but I am not sure,
> how do I pass it.  In the setting tab, nifi automatically populates
> this: ${dynamodb.item.hash.key.value} but looks like this is not the
> right way to do it. Can I get some guidance on this? Thanks for all the
> help.
>
> Best,
>
> Rai
>


Re: PutDynamoDB processor

2016-10-13 Thread Gop Krr
Thanks James. I am looking to iterate through the table so that it takes
hash key values one by one. Do I achieve it through the expression
language? if I write an script to do that, how do I pass it to my processor?
Thanks
Niraj

On Thu, Oct 13, 2016 at 1:42 PM, James Wing  wrote:

> Rai,
>
> The GetDynamoDB processor requires a hash key value to look up an item in
> the table.  The default setting is an Expression Language statement that
> reads the hash key value from a flowfile attribute,
> dynamodb.item.hash.key.value.  But this is not required.  You can change it
> to any attribute expression ${my.hash.key}, or even hard-code a single key
> "item123" if you wish.
>
> Does that help?
>
> Thanks,
>
> James
>
> On Thu, Oct 13, 2016 at 12:17 PM, Gop Krr  wrote:
>
>> Hi All,
>> I have been trying to use get and load processor for the dynamodb and I
>> am almost there. I am able to run the get processor and I see, data is
>> flowing :)
>> But I see the following error in my nifi-app.log file:
>>
>> 2016-10-13 18:02:38,823 ERROR [Timer-Driven Process Thread-9]
>> o.a.n.p.aws.dynamodb.GetDynamoDB 
>> GetDynamoDB[id=7d906337-0157-1000-5868-479d0e0e3580]
>> Hash key value '' is required for flow file StandardFlowFileRecord[uuid=44
>> 554c23-1618-47db-b46e-04ffd737748e,claim=StandardContentClaim
>> [resourceClaim=StandardResourceClaim[id=1476381755460-37287,
>> container=default, section=423], offset=0, length=1048576],offset=0,name=
>> 2503473718684086,size=1048576]
>>
>>
>> I understand that, its looking for the Hash Key Value but I am not sure,
>> how do I pass it.  In the setting tab, nifi automatically populates
>> this: ${dynamodb.item.hash.key.value} but looks like this is not the
>> right way to do it. Can I get some guidance on this? Thanks for all the
>> help.
>>
>> Best,
>>
>> Rai
>>
>
>


Re: GetKafka maximum fetch size

2016-10-13 Thread Jeremy Farbota
Igor,

Kafka consumer properties can be found here:
http://kafka.apache.org/documentation.html#consumerconfigs

GetKafka uses the old consumer so the consumer property is:
fetch.message.max.bytes

The default for that property is ~1M.

If possible, you should limit the replica.fetch.max.bytes on the broker
configs to avoid this error (reset it to the default of 1048576).

AFAIK, you'll have to create a custom version of GetKafka that enables
adjustments to fetch.message.max.bytes.



On Thu, Oct 13, 2016 at 1:22 PM, Igor Kravzov 
wrote:

> Hi,
>
> I am getting the following exception in nifi-0.6.1:
>
> kafka.common.MessageSizeTooLargeException: Found a message larger than
> the maximum fetch size of this consumer. Increase the fetch size, or
> decrease the maximum message size the broker will allow.
>
> What is the max size? How can I increase max size?
>
> Thanks in advance.
>



-- 

[image: Payoff, Inc.] 

Jeremy Farbota
Software Engineer, Data
jfarb...@payoff.com  • (217) 898-8110 <(949)+430-0630>

I'm a Storyteller. Discover your Financial Personality!


[image: Facebook]   [image: Twitter]
 [image: Linkedin]



Re: Rest API Client swagger.json

2016-10-13 Thread Stéphane Maarek
Hi,

Thanks it helps ! Good to know there is already a java client I could use.
Nonetheless I think it would be extremely nice to use the swagger codegen
project to generate additionally sdks, I don't mind creating a github
project of my own to maintain these.

I gave it a go and it gave me a bunch of errors, see
https://github.com/swagger-api/swagger-codegen/issues/3976

I went to https://editor.swagger.io/ , uploaded the swagger.json file and
apparently the swagger.json specs for many (if not all) delete calls are
having wrong specs, see below. Do you think that's worth opening a JIRA?


Swagger Error

Not a valid parameter definition
Jump to line 344
Details
 Object
code: "ONE_OF_MISSING"
 params: Array [0]
message: "Not a valid parameter definition"
 path: Array [5]
0: "paths"
1: "/connections/{id}"
2: "delete"
3: "parameters"
4: "0"
schemaId: "http://swagger.io/v2/schema.json#";
 inner: Array [2]
 0: Object
code: "ONE_OF_MISSING"
 params: Array [0]
message: "Data does not match any schemas from 'oneOf'"
 path: Array [5]
 inner: Array [2]
 0: Object
code: "OBJECT_MISSING_REQUIRED_PROPERTY"
 params: Array [1]
0: "schema"
message: "Missing required property: schema"
 path: Array [0]
 1: Object
code: "ONE_OF_MISSING"
 params: Array [0]
message: "Data does not match any schemas from 'oneOf'"
 path: Array [0]
 inner: Array [4]
 0: Object
code: "ENUM_MISMATCH"
 params: Array [1]
message: "No enum match for: ref"
 path: Array [1]
 1: Object
code: "ENUM_MISMATCH"
 params: Array [1]
message: "No enum match for: ref"
 path: Array [1]
 2: Object
code: "ENUM_MISMATCH"
 params: Array [1]
message: "No enum match for: ref"
 path: Array [1]
 3: Object
code: "ENUM_MISMATCH"
 params: Array [1]
0: "ref"
message: "No enum match for: ref"
 path: Array [1]
0: "type"
 1: Object
code: "OBJECT_MISSING_REQUIRED_PROPERTY"
 params: Array [1]
0: "$ref"
message: "Missing required property: $ref"
 path: Array [5]
0: "paths"
1: "/connections/{id}"
2: "delete"
3: "parameters"
4: "0"
level: 900
type: "Swagger Error"
description: "Not a valid parameter definition"
lineNumber: 344

On Thu, Oct 13, 2016 at 11:43 PM Matt Gilman 
wrote:

Stephane,

Yes, you are correct that Apache NiFi uses swagger. However, we are only
using it for keeping the documentation in sync. We use a maven plugin that
inspects the swagger annotations and generates a swagger.json. The
swagger.json is generated to nifi-web-api/target/swagger-ui/swagger.json at
build time. Subsequently, the swagger.json is run through a handlebars
template to generate the REST API docs.

We provide a client library at


org.apache.nifi
nifi-client-dto
1.0.0


Examples of its usage can be seen in our access control integration tests
[1].

Let me know if you have any other questions. Thanks!

Matt

[1]
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/test/java/org/apache/nifi/integration/accesscontrol/ITProcessorAccessControl.java

On Wed, Oct 12, 2016 at 10:53 PM, Stéphane Maarek  wrote:

Hi,

It seems possible to create an API client for any language using this
project:
https://github.com/swagger-api/swagger-codegen

It needs the swagger.json file. I know it should be generated at build
time, but where can I find it?

Beyond that, would it be useful to extract that file, version control it,
and maybe automatically generate API sdks for many languages using the
project above? Would help tremendously

Cheers
Stephane


Re: Best practices for pushing to production

2016-10-13 Thread Andy LoPresto
Hi Stéphane,

This is a request that has grown popular recently. NiFi was not initially 
designed with environment promotion in mind, so it is something we are 
currently investigating and trying to address.

The development/QA/production environment promotion process [1] (sometimes 
referred to as "SDLC" or "D2P" in conversation) is a topic of much discussion 
amongst the NiFi development community. Currently, there are plans to improve 
this process in a future release. For now, I will discuss some common 
behaviors/workflows that we have seen.

* The $NIFI_HOME/conf/flow.xml.gz file contains the entire flow serialized to 
XML. This file contains all processor configuration values, even sensitive 
values (encrypted). With the new Variable Registry [2] effort, you can refer to 
environment-specific variables transparently, and promote the same flow between 
environments without having to update specific values in the flow itself.
* The XML flow definition or specific templates can be committed and versioned 
using Git (or any other source code control tool). Recent improvements like 
"deterministic template diffs” [3] have made this versioning easier.
* The NiFi REST API [4] can be used to "automate" the deployment of a template 
or flow to an instance of NiFi.
* A script (Groovy, Python, etc.) could be used to integrate with both your 
source code control tool and your various NiFi instances to semi-automate this 
process (i.e. tap into Git hooks detecting a commit, and promote automatically 
to the next environment), but you probably want some human interaction to 
remain for verification at each state.
* To ensure that no “data is left hanging”, you can stop your input sources and 
allow the system to “drain” (i.e. check that all queues are empty before 
stopping the application, if replacing the flow.xml.gz file). If adding a new 
template, you can add this during normal operation and then configure and start 
the new flow while the old continues to run.

We understand that the current state of NiFi is not ideal for the promotion of 
the flow between dev/QA/prod environments. There are ongoing efforts to improve 
this, but I can't describe anything concrete at this time. If these points 
raise specific questions or you think of something else, please follow up.

[1] 
https://cwiki.apache.org/confluence/display/NIFI/Configuration+Management+of+Flows
[2] https://cwiki.apache.org/confluence/display/NIFI/Variable+Registry 

[3] https://issues.apache.org/jira/browse/NIFI-826 

[4] https://nifi.apache.org/docs/nifi-docs/rest-api/index.html


Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 11, 2016, at 7:37 PM, Stéphane Maarek  
> wrote:
> 
> Hi,
> 
> How should we proceed to promote flow from dev to test to prod?
> Basically, our stakeholders are not happy with having people modifying 
> anything in production (so it's a read-only environment), and basically want 
> people to draft / edit their flow in dev, push them to test, and then to prod
> 
> I think this has a couple of challenges, such as :
> 
> 0) How do you capture all the changes that were made? Is it a template, or a 
> list of API calls, or something else?
> 1) How do you programatically promote stuff from dev to test to prod? Stuff 
> that change between these environments are probably hostnames, etc
> 2) How do you edit a flow that's already been promoted to prod?
> 3) How to ensure that no data is left "hanging" in between processors before 
> a change to the flow? Or is it fine?
> 
> Interested in knowing how other companies do it, or manage their NiFi cluster 
> in production (what can be changed vs what can't)
> 
> Cheers,
> Stephane



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Rest API Client swagger.json

2016-10-13 Thread Stéphane Maarek
Investigated some more, open a JIRA issue, closed it via
https://github.com/apache/nifi/pull/1135

On Fri, Oct 14, 2016 at 9:47 AM Stéphane Maarek 
wrote:

> Hi,
>
> Thanks it helps ! Good to know there is already a java client I could use.
> Nonetheless I think it would be extremely nice to use the swagger codegen
> project to generate additionally sdks, I don't mind creating a github
> project of my own to maintain these.
>
> I gave it a go and it gave me a bunch of errors, see
> https://github.com/swagger-api/swagger-codegen/issues/3976
>
> I went to https://editor.swagger.io/ , uploaded the swagger.json file and
> apparently the swagger.json specs for many (if not all) delete calls are
> having wrong specs, see below. Do you think that's worth opening a JIRA?
>
>
> Swagger Error
>
> Not a valid parameter definition
> Jump to line 344
> Details
>  Object
> code: "ONE_OF_MISSING"
>  params: Array [0]
> message: "Not a valid parameter definition"
>  path: Array [5]
> 0: "paths"
> 1: "/connections/{id}"
> 2: "delete"
> 3: "parameters"
> 4: "0"
> schemaId: "http://swagger.io/v2/schema.json#";
>  inner: Array [2]
>  0: Object
> code: "ONE_OF_MISSING"
>  params: Array [0]
> message: "Data does not match any schemas from 'oneOf'"
>  path: Array [5]
>  inner: Array [2]
>  0: Object
> code: "OBJECT_MISSING_REQUIRED_PROPERTY"
>  params: Array [1]
> 0: "schema"
> message: "Missing required property: schema"
>  path: Array [0]
>  1: Object
> code: "ONE_OF_MISSING"
>  params: Array [0]
> message: "Data does not match any schemas from 'oneOf'"
>  path: Array [0]
>  inner: Array [4]
>  0: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> message: "No enum match for: ref"
>  path: Array [1]
>  1: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> message: "No enum match for: ref"
>  path: Array [1]
>  2: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> message: "No enum match for: ref"
>  path: Array [1]
>  3: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> 0: "ref"
> message: "No enum match for: ref"
>  path: Array [1]
> 0: "type"
>  1: Object
> code: "OBJECT_MISSING_REQUIRED_PROPERTY"
>  params: Array [1]
> 0: "$ref"
> message: "Missing required property: $ref"
>  path: Array [5]
> 0: "paths"
> 1: "/connections/{id}"
> 2: "delete"
> 3: "parameters"
> 4: "0"
> level: 900
> type: "Swagger Error"
> description: "Not a valid parameter definition"
> lineNumber: 344
>
> On Thu, Oct 13, 2016 at 11:43 PM Matt Gilman 
> wrote:
>
> Stephane,
>
> Yes, you are correct that Apache NiFi uses swagger. However, we are only
> using it for keeping the documentation in sync. We use a maven plugin that
> inspects the swagger annotations and generates a swagger.json. The
> swagger.json is generated to nifi-web-api/target/swagger-ui/swagger.json at
> build time. Subsequently, the swagger.json is run through a handlebars
> template to generate the REST API docs.
>
> We provide a client library at
>
> 
> org.apache.nifi
> nifi-client-dto
> 1.0.0
> 
>
> Examples of its usage can be seen in our access control integration tests
> [1].
>
> Let me know if you have any other questions. Thanks!
>
> Matt
>
> [1]
> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-web-api/src/test/java/org/apache/nifi/integration/accesscontrol/ITProcessorAccessControl.java
>
> On Wed, Oct 12, 2016 at 10:53 PM, Stéphane Maarek <
> stephane.maa...@gmail.com> wrote:
>
> Hi,
>
> It seems possible to create an API client for any language using this
> project:
> https://github.com/swagger-api/swagger-codegen
>
> It needs the swagger.json file. I know it should be generated at build
> time, but where can I find it?
>
> Beyond that, would it be useful to extract that file, version control it,
> and maybe automatically generate API sdks for many languages using the
> project above? Would help tremendously
>
> Cheers
> Stephane
>
>
>


Re: Nifi hardware recommendation

2016-10-13 Thread Ali Nazemian
Hi,

I have another question regarding the hardware recommendation. As far as I
found out, Nifi uses on-heap memory currently, and it will not try to load
the whole object in memory. From the garbage collection perspective, it is
not recommended to dedicate more than 8-10 GB to JVM heap space. In this
case, may I say spending money on system memory is useless? Probably 16 GB
per each system is enough according to this architecture. Unless some
architecture changes appear in the future to use off-heap memory as well.
However, I found some articles about best practices, and in terms of memory
recommendation it does not make sense. Would you please clarify this part
for me?
Thank you very much.

Best regards,
Ali


On Thu, Oct 13, 2016 at 11:38 PM, Ali Nazemian 
wrote:

> Thank you very much.
> I would be more than happy to provide some benchmark results after the
> implementation.
> Sincerely yours,
> Ali
>
> On Thu, Oct 13, 2016 at 11:32 PM, Joe Witt  wrote:
>
>> Ali,
>>
>> I agree with your assumption.  It would be great to test that out and
>> provide some numbers but intuitively I agree.
>>
>> I could envision certain scatter/gather data flows that could challenge
>> that sequential access assumption but honestly with how awesome disk
>> caching is in Linux these days in think practically speaking this is the
>> right way to think about it.
>>
>> Thanks
>> Joe
>>
>> On Thu, Oct 13, 2016 at 8:29 AM, Ali Nazemian 
>> wrote:
>>
>>> Dear Joe,
>>>
>>> Thank you very much. That was a really great explanation.
>>> I investigated the Nifi architecture, and it seems that most of the
>>> read/write operations for flow file repo and provenance repo are random.
>>> However, for content repo most of the read/write operations are sequential.
>>> Let's say cost does not matter. In this case, even choosing SSD for content
>>> repo can not provide huge performance gain instead of HDD. Am I right?
>>> Hence, it would be better to spend content repo SSD money on network
>>> infrastructure.
>>>
>>> Best regards,
>>> Ali
>>>
>>> On Thu, Oct 13, 2016 at 10:22 PM, Joe Witt  wrote:
>>>
 Ali,

 You have a lot of nice resources to work with there.  I'd recommend the
 series of RAID-1 configuration personally provided you keep in mind this
 means you can only lose a single disk for any one partition.  As long as
 they're being monitored and would be quickly replaced this in practice
 works well.  If there could be lapses in monitoring or time to replace then
 it is perhaps safer to go with more redundancy or an alternative RAID type.

 I'd say do the OS, app installs w/user and audit db stuff, application
 logs on one physical RAID volume.  Have a dedicated physical volume for the
 flow file repository.  It will not be able to use all the space but it
 certainly could benefit from having no other contention.  This could be a
 great thing to have SSDs for actually.  And for the remaining volumes split
 them up for content and provenance as you have.  You get to make the
 overall performance versus retention decision.  Frankly, you have a great
 system to work with and I suspect you're going to see excellent results
 anyway.

 Conservatively speaking expect say 50MB/s of throughput per volume in
 the content repository so if you end up with 8 of them could achieve
 upwards of 400MB/s sustained.  You'll also then want to make sure you have
 a good 10G based network setup as well.  Or, you could dial back on the
 speed tradeoff and simply increase retention or disk loss tolerance.  Lots
 of ways to play the game.

 There are no published SSD vs HDD performance benchmarks that I am
 aware of though this is a good idea.  Having a hybrid of SSDs and HDDs
 could offer a really solid performance/retention/cost tradeoff.  For
 example having SSDs for the OS/logs/provenance/flowfile with HDDs for the
 content - that would be quite nice.  At that rate to take full advantage of
 the system you'd need to have very strong network infrastructure between
 NiFi and any systems it is interfacing with  and your flows would need to
 be well tuned for GC/memory efficiency.

 Thanks
 Joe

 On Thu, Oct 13, 2016 at 2:50 AM, Ali Nazemian 
 wrote:

> Dear Nifi Users/ developers,
> Hi,
>
> I was wondering is there any benchmark about the question that is it
> better to dedicate disk control to Nifi or using RAID for this purpose? 
> For
> example, which of these scenarios is recommended from the performance 
> point
> of view?
> Scenario 1:
> 24 disk in total
> 2 disk- raid 1 for OS and fileflow repo
> 2 disk- raid 1 for provenance repo1
> 2 disk- raid 1 for provenance repo2
> 2 disk- raid 1 for content repo1
> 2 disk- raid 1 for content repo2
> 2 disk- raid 1 for content repo3
> 2 disk- raid 1 for content repo4
> 2 disk- 

Re: Rest API Client swagger.json

2016-10-13 Thread Matt Gilman
Thanks for submitting the PR Stephane! I see that Andy has already stated
that he's reviewing. Thanks Andy!

On Thu, Oct 13, 2016 at 7:42 PM, Stéphane Maarek 
wrote:

> Investigated some more, open a JIRA issue, closed it via
> https://github.com/apache/nifi/pull/1135
>
> On Fri, Oct 14, 2016 at 9:47 AM Stéphane Maarek 
> wrote:
>
>> Hi,
>>
>> Thanks it helps ! Good to know there is already a java client I could
>> use. Nonetheless I think it would be extremely nice to use the swagger
>> codegen project to generate additionally sdks, I don't mind creating a
>> github project of my own to maintain these.
>>
>> I gave it a go and it gave me a bunch of errors, see
>> https://github.com/swagger-api/swagger-codegen/issues/3976
>>
>> I went to https://editor.swagger.io/ , uploaded the swagger.json file
>> and apparently the swagger.json specs for many (if not all) delete calls
>> are having wrong specs, see below. Do you think that's worth opening a JIRA?
>>
>>
>> Swagger Error
>>
>> Not a valid parameter definition
>> Jump to line 344
>> Details
>>  Object
>> code: "ONE_OF_MISSING"
>>  params: Array [0]
>> message: "Not a valid parameter definition"
>>  path: Array [5]
>> 0: "paths"
>> 1: "/connections/{id}"
>> 2: "delete"
>> 3: "parameters"
>> 4: "0"
>> schemaId: "http://swagger.io/v2/schema.json#";
>>  inner: Array [2]
>>  0: Object
>> code: "ONE_OF_MISSING"
>>  params: Array [0]
>> message: "Data does not match any schemas from 'oneOf'"
>>  path: Array [5]
>>  inner: Array [2]
>>  0: Object
>> code: "OBJECT_MISSING_REQUIRED_PROPERTY"
>>  params: Array [1]
>> 0: "schema"
>> message: "Missing required property: schema"
>>  path: Array [0]
>>  1: Object
>> code: "ONE_OF_MISSING"
>>  params: Array [0]
>> message: "Data does not match any schemas from 'oneOf'"
>>  path: Array [0]
>>  inner: Array [4]
>>  0: Object
>> code: "ENUM_MISMATCH"
>>  params: Array [1]
>> message: "No enum match for: ref"
>>  path: Array [1]
>>  1: Object
>> code: "ENUM_MISMATCH"
>>  params: Array [1]
>> message: "No enum match for: ref"
>>  path: Array [1]
>>  2: Object
>> code: "ENUM_MISMATCH"
>>  params: Array [1]
>> message: "No enum match for: ref"
>>  path: Array [1]
>>  3: Object
>> code: "ENUM_MISMATCH"
>>  params: Array [1]
>> 0: "ref"
>> message: "No enum match for: ref"
>>  path: Array [1]
>> 0: "type"
>>  1: Object
>> code: "OBJECT_MISSING_REQUIRED_PROPERTY"
>>  params: Array [1]
>> 0: "$ref"
>> message: "Missing required property: $ref"
>>  path: Array [5]
>> 0: "paths"
>> 1: "/connections/{id}"
>> 2: "delete"
>> 3: "parameters"
>> 4: "0"
>> level: 900
>> type: "Swagger Error"
>> description: "Not a valid parameter definition"
>> lineNumber: 344
>>
>> On Thu, Oct 13, 2016 at 11:43 PM Matt Gilman 
>> wrote:
>>
>> Stephane,
>>
>> Yes, you are correct that Apache NiFi uses swagger. However, we are only
>> using it for keeping the documentation in sync. We use a maven plugin that
>> inspects the swagger annotations and generates a swagger.json. The
>> swagger.json is generated to nifi-web-api/target/swagger-ui/swagger.json
>> at build time. Subsequently, the swagger.json is run through a handlebars
>> template to generate the REST API docs.
>>
>> We provide a client library at
>>
>> 
>> org.apache.nifi
>> nifi-client-dto
>> 1.0.0
>> 
>>
>> Examples of its usage can be seen in our access control integration tests
>> [1].
>>
>> Let me know if you have any other questions. Thanks!
>>
>> Matt
>>
>> [1] https://github.com/apache/nifi/blob/master/nifi-nar-
>> bundles/nifi-framework-bundle/nifi-framework/nifi-web/nifi-
>> web-api/src/test/java/org/apache/nifi/integration/accesscontrol/
>> ITProcessorAccessControl.java
>>
>> On Wed, Oct 12, 2016 at 10:53 PM, Stéphane Maarek <
>> stephane.maa...@gmail.com> wrote:
>>
>> Hi,
>>
>> It seems possible to create an API client for any language using this
>> project:
>> https://github.com/swagger-api/swagger-codegen
>>
>> It needs the swagger.json file. I know it should be generated at build
>> time, but where can I find it?
>>
>> Beyond that, would it be useful to extract that file, version control it,
>> and maybe automatically generate API sdks for many languages using the
>> project above? Would help tremendously
>>
>> Cheers
>> Stephane
>>
>>
>>


Re: Rest API Client swagger.json

2016-10-13 Thread Andy LoPresto
Stéphane asked a question on the PR but as it was already closed, I wanted to 
reproduce it here for visibility and to see if other community members had 
something to add:

Stéphane:

good stuff. Quick question, what do you think of NiFi automating the build and 
release of API clients in various language, vs me just maintaining it in my own 
repos?

My reply:

I don't think that's something we should do at this time. The NiFi build is 
already long enough (10 - 20 minutes on commodity hardware depending on 
multithreading, tests, and contrib-check) and I don't think many people need 
this functionality, much less in multiple languages, and I don't know how we 
would determine the default languages -- maybe just Java to start as that's 
what the codebase is written in?

If you want to submit a PR for a custom profile that builds the API clients, I 
wouldn't object to that as long as it was disabled by default. Not sure what 
others' feelings are.

Andy LoPresto
alopre...@apache.org
alopresto.apa...@gmail.com
PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69

> On Oct 13, 2016, at 5:13 PM, Matt Gilman  wrote:
> 
> Thanks for submitting the PR Stephane! I see that Andy has already stated 
> that he's reviewing. Thanks Andy!
> 
> On Thu, Oct 13, 2016 at 7:42 PM, Stéphane Maarek  > wrote:
> Investigated some more, open a JIRA issue, closed it via 
> https://github.com/apache/nifi/pull/1135 
> 
> 
> On Fri, Oct 14, 2016 at 9:47 AM Stéphane Maarek  > wrote:
> Hi,
> 
> Thanks it helps ! Good to know there is already a java client I could use. 
> Nonetheless I think it would be extremely nice to use the swagger codegen 
> project to generate additionally sdks, I don't mind creating a github project 
> of my own to maintain these.
> 
> I gave it a go and it gave me a bunch of errors, see  
> https://github.com/swagger-api/swagger-codegen/issues/3976 
> 
> 
> I went to https://editor.swagger.io/  , uploaded 
> the swagger.json file and apparently the swagger.json specs for many (if not 
> all) delete calls are having wrong specs, see below. Do you think that's 
> worth opening a JIRA?
> 
> 
> Swagger Error
> Not a valid parameter definition
> Jump to line 344
>  <>Details
>  Object <>
> code: "ONE_OF_MISSING" <>
>  params: Array [0] <>
> message: "Not a valid parameter definition" <>
>  path: Array [5] <>
> 0: "paths" <>
> 1: "/connections/{id}" <>
> 2: "delete" <>
> 3: "parameters" <>
> 4: "0" <>
> schemaId: "http://swagger.io/v2/schema.json#"; <>
>  inner: Array [2] <>
>  0: Object <>
> code: "ONE_OF_MISSING" <>
>  params: Array [0] <>
> message: "Data does not match any schemas from 'oneOf'" <>
>  path: Array [5] <>
>  inner: Array [2] <>
>  0: Object <>
> code: "OBJECT_MISSING_REQUIRED_PROPERTY" <>
>  params: Array [1] <>
> 0: "schema" <>
> message: "Missing required property: schema" <>
>  path: Array [0] <>
>  1: Object <>
> code: "ONE_OF_MISSING" <>
>  params: Array [0] <>
> message: "Data does not match any schemas from 'oneOf'" <>
>  path: Array [0] <>
>  inner: Array [4] <>
>  0: Object <>
> code: "ENUM_MISMATCH" <>
>  params: Array [1] <>
> message: "No enum match for: ref" <>
>  path: Array [1] <>
>  1: Object <>
> code: "ENUM_MISMATCH" <>
>  params: Array [1] <>
> message: "No enum match for: ref" <>
>  path: Array [1] <>
>  2: Object <>
> code: "ENUM_MISMATCH" <>
>  params: Array [1] <>
> message: "No enum match for: ref" <>
>  path: Array [1] <>
>  3: Object <>
> code: "ENUM_MISMATCH" <>
>  params: Array [1] <>
> 0: "ref" <>
> message: "No enum match for: ref" <>
>  path: Array [1] <>
> 0: "type" <>
>  1: Object <>
> code: "OBJECT_MISSING_REQUIRED_PROPERTY" <>
>  params: Array [1] <>
> 0: "$ref" <>
> message: "Missing required property: $ref" <>
>  path: Array [5] <>
> 0: "paths" <>
> 1: "/connections/{id}" <>
> 2: "delete" <>
> 3: "parameters" <>
> 4: "0" <>
> level: 900 <>
> type: "Swagger Error" <>
> description: "Not a valid parameter definition" <>
> lineNumber: 344 <>
> 
> On Thu, Oct 13, 2016 at 11:43 PM Matt Gilman  > wrote:
> Stephane,
> 
> Yes, you are correct that Apache NiFi uses swagger. However, we are only 
> using it for keeping the documentation in sync. We use a maven plugin that 
> inspects the swagger annotations and generates a swagger.json. The 
> swagger.json is generated to nifi-web-api/target/swagger-ui/swagger.json at 
> build time. Subsequently, the swagger.json is run through a handlebars 
> template to generate the REST API docs.
> 
> We provide a client library at
> 
> org.apache.nifi
> nifi-client-dto
> 1.0.0
> 
> Examples of its usage can be seen in our access control integration tests [1].
> 
> Let me know if you have any other questions. Thanks!
> 
> Matt
> 
> [1] 
> https://github.com/apache/nifi/blob/master/ni

Re: Push x Pull ETL

2016-10-13 Thread Márcio Faria
Jeff,
Many thanks. I'm now more confident NiFi could be a good fit for us.
Marcio
 

On Wednesday, October 12, 2016 9:06 PM, Jeff  wrote:
 

 Hello Marcio,
You're asking on the right list!
Based on the scenario you described, I think NiFi would suit your needs.  To 
address your 3 major steps of your workflow:
1) Processors can run based on a timer-based or cron-based schedule.  
GenerateTableFetch is a processor that can be used to create SQL SELECT 
statements from a table based on increasing values in one or more columns, and 
can be partitioned depending on your batching needs.  These SQL SELECT 
statements can then be executed against the destination database by use of the 
PutSQL processor.
2) With the more recent data, which I'm assuming is queried from the 
destination database, you can use QueryDatabaseTable to retrieve the new rows 
in Avro format and then transform as needed, which may include processors that 
encapsulate any custom logic you might have written for your homemade ETL 
solution
3) The PostHTTP processor can be used to send files over HTTPS to the external 
server.
Processors have failure relationships when processing for a flow file fails, 
and can be routed as appropriate, such as retrying failed flow files.  For 
errors that require human intervention, there are a number of options.  Most 
likely, the way your homemade solution currently handles errors that require 
human intervention can be done by NiFi as well.
Personally, I have used NiFi in similar ways to what you have described.  There 
are some examples on the Apache NiFi site [1] that you can check out.  Your 
questions about the stopping and restarting of processing when errors occur is 
possible, though much of that is in how you design your flow.
Feel free to ask any questions!  Much of the information above is fairly 
high-level, and NiFi offers a lot of processors to meet your data flow needs.
- Jeff 
On Tue, Oct 11, 2016 at 5:18 PM Márcio Faria  wrote:

Hi,

Potential NiFi user here.

I'm trying to figure out if NiFi could be a good choice to replace our existent 
homemade ETL system, which roughly works like this:

1) Either on demand or at periodic instants, fetch fresh rows from one or more 
tables in the source database and insert or update them into the destination 
database;

2) Run the jobs which depend on the more recent data, and generate files based 
on those;

3) Upload the generated files to an external server using HTTPS.

Since our use cases are more of a "pull" style (Ex: It's time to run the report 
-> get the required data updated -> run the processing job and submit the 
results) than "push" (Ex: Get the latest data available -> when some condition 
is met, run the processing job and submit the results), I'm wondering if NiFi, 
or any other flow-based toolset for that matter, would be a good option for us 
to try or not. Your opinion? Suggestions?

Besides, what is the recommended way to handle errors in a ETL scenario like 
that? For example, we submit a "page" of rows to a remote server and its 
response tells us which of those rows were accepted and which ones had a 
validation error. What would be the recommended approach to handle such errors 
if the fix requires some human intervention? Is there a way of stopping the 
whole flow until the correction is done? How to restart it when part of the 
data were already processed by some of the processors? The server won't accept 
a transaction B if it depends on a transaction A that wasn't successfully 
submitted before.

As you see, our processing is very batch-oriented. I know NiFi can fetch data 
in chunks from a relational database, but I'm not sure how to approach the 
conversion from our current style to a more "stream"-oriented one. I'm afraid I 
could try to use the "right tool for the wrong problem", if you know what I 
mean.
Apologies if this is not the proper venue to ask. I checked all the posts in 
this mailing list and also tried to search for information elsewhere, but I 
wasn't able to find the answers myself. 

Any guidance, like examples or links to further reading, would be very much 
appreciated. I'm just starting to learn the ropes.

Thank you,
Marcio



   

Re: Push x Pull ETL

2016-10-13 Thread Jeff
Great to hear, Marcio!

On Thu, Oct 13, 2016 at 9:26 PM Márcio Faria  wrote:

> Jeff,
>
> Many thanks. I'm now more confident NiFi could be a good fit for us.
>
> Marcio
>
>
> On Wednesday, October 12, 2016 9:06 PM, Jeff  wrote:
>
>
> Hello Marcio,
>
> You're asking on the right list!
>
> Based on the scenario you described, I think NiFi would suit your needs.
> To address your 3 major steps of your workflow:
>
> 1) Processors can run based on a timer-based or cron-based schedule.
> GenerateTableFetch is a processor that can be used to create SQL SELECT
> statements from a table based on increasing values in one or more columns,
> and can be partitioned depending on your batching needs.  These SQL SELECT
> statements can then be executed against the destination database by use of
> the PutSQL processor.
>
> 2) With the more recent data, which I'm assuming is queried from the
> destination database, you can use QueryDatabaseTable to retrieve the new
> rows in Avro format and then transform as needed, which may include
> processors that encapsulate any custom logic you might have written for
> your homemade ETL solution
>
> 3) The PostHTTP processor can be used to send files over HTTPS to the
> external server.
>
> Processors have failure relationships when processing for a flow file
> fails, and can be routed as appropriate, such as retrying failed flow
> files.  For errors that require human intervention, there are a number of
> options.  Most likely, the way your homemade solution currently handles
> errors that require human intervention can be done by NiFi as well.
>
> Personally, I have used NiFi in similar ways to what you have described.
> There are some examples on the Apache NiFi site [1] that you can check
> out.  Your questions about the stopping and restarting of processing when
> errors occur is possible, though much of that is in how you design your
> flow.
>
> Feel free to ask any questions!  Much of the information above is fairly
> high-level, and NiFi offers a lot of processors to meet your data flow
> needs.
>
> - Jeff
>
> On Tue, Oct 11, 2016 at 5:18 PM Márcio Faria 
> wrote:
>
> Hi,
>
> Potential NiFi user here.
>
> I'm trying to figure out if NiFi could be a good choice to replace our
> existent homemade ETL system, which roughly works like this:
>
> 1) Either on demand or at periodic instants, fetch fresh rows from one or
> more tables in the source database and insert or update them into the
> destination database;
>
> 2) Run the jobs which depend on the more recent data, and generate files
> based on those;
>
> 3) Upload the generated files to an external server using HTTPS.
>
> Since our use cases are more of a "pull" style (Ex: It's time to run the
> report -> get the required data updated -> run the processing job and
> submit the results) than "push" (Ex: Get the latest data available -> when
> some condition is met, run the processing job and submit the results), I'm
> wondering if NiFi, or any other flow-based toolset for that matter, would
> be a good option for us to try or not. Your opinion? Suggestions?
>
> Besides, what is the recommended way to handle errors in a ETL scenario
> like that? For example, we submit a "page" of rows to a remote server and
> its response tells us which of those rows were accepted and which ones had
> a validation error. What would be the recommended approach to handle such
> errors if the fix requires some human intervention? Is there a way of
> stopping the whole flow until the correction is done? How to restart it
> when part of the data were already processed by some of the processors? The
> server won't accept a transaction B if it depends on a transaction A that
> wasn't successfully submitted before.
>
> As you see, our processing is very batch-oriented. I know NiFi can fetch
> data in chunks from a relational database, but I'm not sure how to approach
> the conversion from our current style to a more "stream"-oriented one. I'm
> afraid I could try to use the "right tool for the wrong problem", if you
> know what I mean.
>
> Apologies if this is not the proper venue to ask. I checked all the posts
> in this mailing list and also tried to search for information elsewhere,
> but I wasn't able to find the answers myself.
>
> Any guidance, like examples or links to further reading, would be very
> much appreciated. I'm just starting to learn the ropes.
>
> Thank you,
> Marcio
>
>
>
>


Re: Rest API Client swagger.json

2016-10-13 Thread Stéphane Maarek
Thanks,

FYI, I've started to host my own swagger-codegen generated Java Client on
my github: https://github.com/simplesteph/nifi-api-client-java  . Check out
the docs!
If you want to start playing and get a feel for it:


public static void main(String[] args) {

ApiClient apiClient = new ApiClient();
apiClient.setBasePath("http://localhost:8080/nifi-api";);

FlowApi flowApiInstance = new FlowApi(apiClient);
try {

String result = flowApiInstance.generateClientId();
System.out.println(result);
} catch (ApiException e) {
e.printStackTrace();
}
}

Cheers,
Stephane

On Fri, Oct 14, 2016 at 11:27 AM Andy LoPresto  wrote:

> Stéphane asked a question on the PR but as it was already closed, I wanted
> to reproduce it here for visibility and to see if other community members
> had something to add:
>
> Stéphane:
>
> good stuff. Quick question, what do you think of NiFi automating the build
> and release of API clients in various language, vs me just maintaining it
> in my own repos?
>
> My reply:
>
> I don't think that's something we should do at this time. The NiFi build
> is already long enough (10 - 20 minutes on commodity hardware depending on
> multithreading, tests, and contrib-check) and I don't think many people
> need this functionality, much less in multiple languages, and I don't know
> how we would determine the default languages -- maybe just Java to start as
> that's what the codebase is written in?
> If you want to submit a PR for a custom profile that builds the API
> clients, I wouldn't object to that as long as it was disabled by default.
> Not sure what others' feelings are.
>
> Andy LoPresto
> alopre...@apache.org
> *alopresto.apa...@gmail.com *
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
> On Oct 13, 2016, at 5:13 PM, Matt Gilman  wrote:
>
> Thanks for submitting the PR Stephane! I see that Andy has already stated
> that he's reviewing. Thanks Andy!
>
> On Thu, Oct 13, 2016 at 7:42 PM, Stéphane Maarek <
> stephane.maa...@gmail.com> wrote:
>
> Investigated some more, open a JIRA issue, closed it via
> https://github.com/apache/nifi/pull/1135
>
> On Fri, Oct 14, 2016 at 9:47 AM Stéphane Maarek 
> wrote:
>
> Hi,
>
> Thanks it helps ! Good to know there is already a java client I could use.
> Nonetheless I think it would be extremely nice to use the swagger codegen
> project to generate additionally sdks, I don't mind creating a github
> project of my own to maintain these.
>
> I gave it a go and it gave me a bunch of errors, see
> https://github.com/swagger-api/swagger-codegen/issues/3976
>
> I went to https://editor.swagger.io/ , uploaded the swagger.json file and
> apparently the swagger.json specs for many (if not all) delete calls are
> having wrong specs, see below. Do you think that's worth opening a JIRA?
>
>
> Swagger Error
> Not a valid parameter definition
> Jump to line 344
> Details
>  Object
> code: "ONE_OF_MISSING"
>  params: Array [0]
> message: "Not a valid parameter definition"
>  path: Array [5]
> 0: "paths"
> 1: "/connections/{id}"
> 2: "delete"
> 3: "parameters"
> 4: "0"
> schemaId: "http://swagger.io/v2/schema.json#";
>  inner: Array [2]
>  0: Object
> code: "ONE_OF_MISSING"
>  params: Array [0]
> message: "Data does not match any schemas from 'oneOf'"
>  path: Array [5]
>  inner: Array [2]
>  0: Object
> code: "OBJECT_MISSING_REQUIRED_PROPERTY"
>  params: Array [1]
> 0: "schema"
> message: "Missing required property: schema"
>  path: Array [0]
>  1: Object
> code: "ONE_OF_MISSING"
>  params: Array [0]
> message: "Data does not match any schemas from 'oneOf'"
>  path: Array [0]
>  inner: Array [4]
>  0: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> message: "No enum match for: ref"
>  path: Array [1]
>  1: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> message: "No enum match for: ref"
>  path: Array [1]
>  2: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> message: "No enum match for: ref"
>  path: Array [1]
>  3: Object
> code: "ENUM_MISMATCH"
>  params: Array [1]
> 0: "ref"
> message: "No enum match for: ref"
>  path: Array [1]
> 0: "type"
>  1: Object
> code: "OBJECT_MISSING_REQUIRED_PROPERTY"
>  params
>
>