Thanks Joe. On Sun, May 1, 2016 at 2:55 PM, Joe Witt <joe.w...@gmail.com> wrote:
> Igor, > > There is no automatic failover of the the node that is considered > primary. For the upcoming 1.x release though this has been addressed > https://issues.apache.org/jira/browse/NIFI-483 > > Thanks > Joe > > On Sun, May 1, 2016 at 2:36 PM, Igor Kravzov <igork.ine...@gmail.com> > wrote: > > Thanks Aldrin for the repose. > > What didn't fully understand from documentation: is automatic fail-over > > implemented? I would rather configure entire workflow to run "On primary > > node". > > > > > > On Sun, May 1, 2016 at 1:31 PM, Aldrin Piri <aldrinp...@gmail.com> > wrote: > >> > >> Igor, > >> > >> Your thoughts are correct, and without any additional configuration, the > >> GetTwitter processor would run on both nodes. The way to avoid this is > to > >> select the "On primary node" scheduling strategy which would only have > the > >> processor run on whichever node is currently primary. > >> > >> PutHDFS has similar semantics but these would likely be desired. > Consider > >> where data is partitioned across each of the nodes. PutHDFS would then > need > >> to run on each node to ensure the data is delivered to HDFS. The > property > >> you list is that of where the data should land on the configured HDFS > >> instance. Often times this is done via Expression Language (EL) to get > the > >> familiar time slicing of resources when persisted such as > >> ${now():format('yyyy/MM/dd/HH')}. You could additionally have directory > >> structure that mirrors the data making use of attributes the files may > have > >> gained as they made their way through your flow or an UpdateAttribute > to set > >> a property, such as "hadoop.dest.dir", that is used by the final PutHDFS > >> property to give a dynamic location on a per FlowFile basis. > >> > >> Let us know if you have additional questions or if things are unclear. > >> > >> --aldrin > >> > >> > >> On Sun, May 1, 2016 at 1:20 PM, Igor Kravzov <igork.ine...@gmail.com> > >> wrote: > >>> > >>> If I understand correctly in cluster mode the same dataflow runs on all > >>> the notes. > >>> So let's say I have a simple dataflow with GetTwitter and PutHDFS > >>> processors. And one NCM + 2 nodes. > >>> Does it actually that mean the GetTwitter will be called independently > >>> and potentially simultaneously on each node and there may be duplicate > >>> results? > >>> How about PutHDFS processor? To where "hadoop configuration resources" > >>> "parent HDFS directory" should point to in each node? > >> > >> > > >