from:"Bryan Bende"

Re: Filtering GetTwitter with both terms and locations

2015-10-21 Thread Bryan Bende

>From looking at the processor code it looks like it adds both the terms and
locations to the filter endpoint and should be able to filter on both. The
processor leverages the Hosebird Client [1] so it could be possible that
library is not working as expected.

Is there a specific example of terms that aren't working? or they never
work in conjunction with locations?

[1] https://github.com/twitter/hbc

On Wed, Oct 21, 2015 at 3:03 PM, David Klim  wrote:

> Hello,
>
> I am trying to get data from Twitter filter endpoint using a both location
> (bounding box ) and terms to filter on. The data I get is not being
> filtered by terms at all. Is there any known problem with the feature? Not
> sure if the processor behaves as I expect.
>
> Thanks a lot!
>
>
>

Re: how to putkafka to multiple partions

2015-10-21 Thread Bryan Bende

Hello,

This might related to:
https://issues.apache.org/jira/browse/NIFI-1020

Does anyone know if the upgrade of the Kafka client for the next NiFi
release addresses this issue?

-Bryan


On Mon, Oct 19, 2015 at 4:01 AM, 彭光裕  wrote:

> hi,
>
>
>
>  I have a test topic in kafka with 6 partitions. When I use default
> PutKafka of nifi to put messages(let’s say 100/batch), I found it usually
> only put to one partition.
>
> I have tried the kafka key property to be ‘${nextInt():mod(6)}’ or 
> ‘${uuid()}’, but it doesn’t seems work. How do I put the messages to the test 
> topic with like round robin policy?
>
>
>
> Any advice will be welcome. Thank you.
>
>
>
>
>
> Roland.
>
>
>
>
> *本信件可能包含中華電信股份有限公司機密資訊,非指定之收件者,請勿蒐集、處理或利用本信件內容,並請銷毀此信件.
> 如為指定收件者,應確實保護郵件中本公司之營業機密及個人資料,不得任意傳佈或揭露,並應自行確認本郵件之附檔與超連結之安全性,以共同善盡資訊安全與個資保護責任.
> Please be advised that this email message (including any attachments)
> contains confidential information and may be legally privileged. If you are
> not the intended recipient, please destroy this message and all attachments
> from your system and do not further collect, process, or use them. Chunghwa
> Telecom and all its subsidiaries and associated companies shall not be
> liable for the improper or incomplete transmission of the information
> contained in this email nor for any delay in its receipt or damage to your
> system. If you are the intended recipient, please protect the confidential
> and/or personal information contained in this email with due care. Any
> unauthorized use, disclosure or distribution of this message in whole or in
> part is strictly prohibited. Also, please self-inspect attachments and
> hyperlinks contained in this email to ensure the information security and
> to protect personal information.*
>

Re: StoreInKiteDataset cannot recognize hive dataset

2015-11-09 Thread Bryan Bende

Hello,

I'm not that familiar with Kite, but is it possible that you need to create
the Kite dataset using the Kite CLI before StoreInKiteDataset tries to
write data to it?

It looks like that is how the test cases for this processor work:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-kite-bundle/nifi-kite-processors/src/test/java/org/apache/nifi/processors/kite/TestKiteProcessorsCluster.java#L85

It uses "dataset:hive:ns/test", but calls Datasets.create(...) before
running the processor.

-Bryan


On Mon, Nov 9, 2015 at 5:06 AM, panfei  wrote:

> Hi all:
> I use StoreInKiteDataset processor to try to store data in hive by
> configuring the target URI to:
>
> *dataset:hive:default/sandwiches*
>
> but the processor reports that* the URI is invalid*. but after replacing
> the URI to
>
> *dataset:file:/tmp/sandwiches*
>
> everything works OK.
>
>
> Is there any way to resolve the Hive  dataset issue ?  or it is not
> supported at all ?
>
>
> Thank you very much
> --
> 不学习，不知道
>

Re: Managing flows

2015-11-11 Thread Bryan Bende

In addition to what Mark said, there is also the option of templates [1].
Templates let you export a portion, or all of your flow,
and then import it again later. When you export a template it will not
export any properties that are marked as sensitive properties,
so it is safe to share with others.

Regarding "one flow", you can have as many different logical flows with in
one nifi instance as you want, but it is managed as one flow behind the
scenes.

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#templates

On Wed, Nov 11, 2015 at 10:00 AM, Darren Govoni  wrote:

> Mark,
>Thanks for the tips. Appreciate it.
>
> So when I run nifi on a single server. It is essentially "one flow"?
> If I wanted to have say 2 or 3 active flows, I would (reasonably) have to
> run more instances of nifi with appropriate
> configuration to not conflict. Is that right?
>
> Darren
>
>
> On 11/11/2015 09:54 AM, Mark Petronic wrote:
>
>> Look in your Nifi conf directory. The active flow is there as an aptly
>> named .gz file. Guessing you could just rename that and restart Nifi
>> which would create a blank new one. Build up another flow, then you
>> could repeat the same "copy to new file name" and restore some other
>> one to continue on some previous flow/. I'm pretty new to Nifi, too,
>> so maybe there is another way. Also, you can create point-in-time
>> backups of your from from the "Settings" dialog in the DFM. There is a
>> link that shows up in there to click. It will copy your master flow gz
>> to your conf/archive directory. You can create multiple snapshots of
>> your flow to retain change history. I actually gunzip my backups and
>> commit them to Git for a more formal change history tracking
>> mechanism.
>>
>> Hope that helps.
>>
>> On Wed, Nov 11, 2015 at 9:45 AM, Darren Govoni 
>> wrote:
>>
>>> Hi again,
>>> Sorry for the noob questions. I am reading all the online material as
>>> much as possible.
>>> But what hasn't jumped out at me yet is how flows are managed?
>>>
>>> Are they saved, loaded, etc? I access my nifi and build a flow. Now I
>>> want
>>> to save it and work on another flow.
>>> Lastly, will the flow be running even if I exit the webapp?
>>>
>>> thanks for any tips. If I missed something obvious, regrets.
>>>
>>> D
>>>
>>
>

Re: Post to REST service?

2015-11-11 Thread Bryan Bende

Hello,

You should be able to use expression language in the URL value, you could
reference any attribute by doing the following: ${attributeName}. So your
URL could be http://myhost/${id}

-Bryan

On Wed, Nov 11, 2015 at 8:03 AM, Darren Govoni  wrote:

> Hi,
>   I am trying to get my PostHTTP processor to post the incoming content to
> a REST url.
> The incoming flowfile has an attribute set 'id' that needs to be part of
> the URL of the POST.
>
> Is there a notation for parameterizing the post URL from flowfile
> attributes?
>
> thanks,
> Darren
>

Re: Why does PutFile create directories for you but PutHDFS does not?

2015-11-12 Thread Bryan Bende

I think PutHDFS always creates them so it isn't an option through the
properties.

-Bryan

On Thu, Nov 12, 2015 at 8:19 AM, Mark Payne  wrote:

> Mark,
>
> My guess is that it was an oversight. I don't believe it was intentional
> to leave it out of PutHDFS.
>
> Thanks
> -Mark
>
>
> > On Nov 12, 2015, at 1:57 AM, Mark Petronic 
> wrote:
> >
> > Just wondering about the history behind why one has the logic to
> > create them but the other does not?
>
>

Re: template repo for learning

2015-11-03 Thread Bryan Bende

If you can share a little more info about what the API that you're trying
to interact with looks like, we can likely provide more concrete guidance.

As a very basic test, to familiarize yourself with the Expression Language,
you could create a "dummy flow" such as:
- GenerateFlowFile to create a new FlowFile on a timer
- UpdateAttribute to set attributes you want to be passed to your API, you
can use expression language here to create dynamic date/time values
- InvokeHttp to call your API
- You could then route the "Response" relationship from InvokeHttp to some
other processor to possibly extract information from the response for
further use

Let us know if we can help more.

On Tue, Nov 3, 2015 at 9:51 AM, Christopher Hamm <ceham...@gmail.com> wrote:

> I am trying to query api based on date/time and possibly based on results
> from fields of another query.
> On Nov 3, 2015 9:41 AM, "Bryan Bende" <bbe...@gmail.com> wrote:
>
>> Christopher,
>>
>> In terms of templates, the best resources we have right now are:
>>
>> https://cwiki.apache.org/confluence/display/NIFI/Example+Dataflow+Templates
>> https://github.com/xmlking/nifi-examples
>>
>> For expression language we have the EL guide:
>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html
>>
>> Is there a specific flow you are trying to tackle?
>>
>> -Bryan
>>
>>
>> On Tue, Nov 3, 2015 at 9:36 AM, Christopher Hamm <
>> em...@christopherhamm.com> wrote:
>>
>>> Is there a repo of nifi templates with advanced features that use lots
>>> of expression language expecially when used to make requests? I can't find
>>> enough docs or youtube videos that really dig into it.
>>>
>>
>>

Re: Filtering GetTwitter with both terms and locations

2015-11-01 Thread Bryan Bende

Hi David,

After re-reading the Twitter API documentation [1], it says:

"The track, follow, and locations fields should be considered to be
combined with an OR operator. track=foo=1234 returns Tweets matching
“foo” OR created by user 1234."

-Bryan
[1] https://dev.twitter.com/streaming/reference/post/statuses/filter


On Sun, Nov 1, 2015 at 5:08 PM, Juan Jose Escobar <
juanjose.esco...@gmail.com> wrote:

> Hello, David
>
> In a previous post you mentioned you are filtering by location and terms.
> Keep in mind the way the filtering works in the GetTwitter processor: in
> your case it will return all twits that are associated to the specified
> bounding box OR matching any of the terms. Do not expect the output of the
> processor to contain twits matching both conditions. You will need to
> implement one of the conditions separately to do so. Unless the frequency
> of the terms is high, I would say the best approach is to filter only by
> terms, and then add additional filtering for the location using additional
> processors. There are many options here, e.g. you could use
> EvaluateJSonPath (write to attribute), then RouteOnAttribute.
>
> Hope this helps
>
> On Sun, Nov 1, 2015 at 10:47 PM, David Klim 
> wrote:
>
>>
>> I have tested this again with different terms combinations but it seems
>> it's ignoring the filtering. Any ideas on how to fix it?
>>
>> Thanks in advance!
>>
>> --
>> From: davidkl...@hotmail.com
>> To: users@nifi.apache.org
>> Subject: RE: Filtering GetTwitter with both terms and locations
>> Date: Tue, 27 Oct 2015 23:46:00 +0100
>>
>> Hello,
>>
>> As far as I can see from my testing, terms seem to be ignored.
>>
>> Here is the twitter processor configuration:
>>
>> ---
>> twitter endpoint: filter endpoint
>> terms to filter on: kmeans,kmean
>> locations to filter on: -124.476284,32.172870,-59.437227,48.862583
>> ---
>>
>>
>> --
>> Date: Wed, 21 Oct 2015 16:09:02 -0400
>> Subject: Re: Filtering GetTwitter with both terms and locations
>> From: bbe...@gmail.com
>> To: users@nifi.apache.org
>>
>> From looking at the processor code it looks like it adds both the terms
>> and locations to the filter endpoint and should be able to filter on both.
>> The processor leverages the Hosebird Client [1] so it could be possible
>> that library is not working as expected.
>>
>> Is there a specific example of terms that aren't working? or they never
>> work in conjunction with locations?
>>
>> [1] https://github.com/twitter/hbc
>>
>> On Wed, Oct 21, 2015 at 3:03 PM, David Klim 
>> wrote:
>>
>> Hello,
>>
>> I am trying to get data from Twitter filter endpoint using a both
>> location (bounding box ) and terms to filter on. The data I get is not
>> being filtered by terms at all. Is there any known problem with the
>> feature? Not sure if the processor behaves as I expect.
>>
>> Thanks a lot!
>>
>>
>>
>>
>

Re: JSON / Avro issues

2015-11-05 Thread Bryan Bende

Jeff,

Are you using the 0.3.0 release?

I think this is the issue you ran into which is resolved for the next
release:
https://issues.apache.org/jira/browse/NIFI-944

With regards to ConvertJSONtoAvro, I believe it one json document per line
with a new line at the end of each line (your second example).

-Bryan

On Thu, Nov 5, 2015 at 4:59 PM, Jeff  wrote:

> I built a simple flow that reads a tab separated file and attempts to
> convert to Avro.
>
> ConvertCSVtoAvro just says that the conversion failed.
>
> Where can I find more information on what the failure was?
>
> Using the same sample tab separated file, I create a JSON file out of it.
>
> The JSON to Avro processor also fails with very little explication.
>
>
> With regard to the ConvertCSVtoAvro processor
> Since my file is tab  delimited, do I simple open the "CSV
> delimiter” property, delete , and hit the tab key or is there a special
> syntax like ^t?
> My data has no CSV quote character so do I leave this as “or
> delete it or check the empty box?
>
> With regard to the ConvertJSONtoAvro
> What is the expected JSON source file to look like?
> [
>  {fields values … },
>  {fields values …}
> ]
> Or
>  {fields values … }
>  {fields values …}
> or something else.
>
> Thanks,
>
> Sorry for send this to 2 lists

Re: Referencing Source File Attributes with MergeContent Processor

2015-10-14 Thread Bryan Bende

I'm not sure if this makes sense with out seeing the full flow, but can you
construct the Json for each image before MergeContent by using ReplaceText,
this way the id will already be taken care of before merging?


On Wednesday, October 14, 2015, Chris Mangold  wrote:

> I have use case where I am trying to merge content using the NIFI
> MergeContent processor.
>
> Scenario is that I am retrieving image files from a server, base64
> encoding them and then batching them together into a JSON format using
> MergeContent. I am using the Header, Footer and Demarcator fields to
> constuct a JSON object array:
>
> [...
>
> {
>
> "id": 
>
> "base64": from merged flow file
>
> }
>
> ..
>
> ]
>
> Using the the NIFI expression language I was hoping to reference an
> attribute on the incoming files to populate the "id" field, and have it
> unique. All works well except my "id" files are the same for all the merged
> files. By using the expression language it appears the attribute value from
> only one of the files is being evaluated.
>
> Example of my expression language:
> "},{"recordid":${RecordID:trim()},"base64Image":
>
>
> Any advice would be helpful.
>
> Thanks in advance.
>
>
> Chris
>
>
> --
> Chris Mangold
> 301-471-5758 (c)
> 301-898-7979 (h)
>


-- 
Sent from Gmail Mobile

Re: Need help in nifi- flume processor

2015-10-08 Thread Bryan Bende

Hi Parul,

It is possible to deploy a custom Flume source/sink to NiFi, but due to the
way the Flume processors load the classes for the sources and sinks, the
jar you deploy to the lib directory also needs to include the other
dependencies your source/sink needs (or they each need to individually be
in lib/ directly).

So here is a sample project I created that makes a shaded jar:
https://github.com/bbende/my-flume-source

It will contain the custom source and following dependencies all in one jar:

org.apache.flume:my-flume-source:jar:1.0-SNAPSHOT
+- org.apache.flume:flume-ng-sdk:jar:1.6.0:compile
+- org.apache.flume:flume-ng-core:jar:1.6.0:compile
+- org.apache.flume:flume-ng-configuration:jar:1.6.0:compile
+- org.apache.flume:flume-ng-auth:jar:1.6.0:compile
  \- com.google.guava:guava:jar:11.0.2:compile
 \- com.google.code.findbugs:jsr305:jar:1.3.9:compile

I copied that to NiFi lib, restarted, created an ExecuteFlumeSource
processor with the following config:

Source Type = org.apache.flume.MySource
Agent Name = a1
Source Name = r1
Flume Configuration = a1.sources = r1

And I was getting the output in nifi/logs/nifi-bootstrap.log

Keep in mind that this could become risky because any classes found in the
lib directory would be accessible to all NARs in NiFi and would be found
before classes within a NAR because the parent is checked first during
class loading. This example isn't too risky because we are only bringing in
flume jars and one guava jar, but for example if another nar uses a
different version of guava this is going to cause a problem.

If you plan to use NiFi for the long term, it might be worth investing in
converting your custom Flume components to NiFi processors. We can help you
get started if you need any guidance going that route.

-Bryan


On Thu, Oct 8, 2015 at 2:30 AM, Parul Agrawal <parulagrawa...@gmail.com>
wrote:

> Hello Bryan,
>
> Thank you very much for your response.
>
> Is it possible to have customized flume source and sink in Nifi?
> I have my own customized source and sink? I followed below steps to add my
> own customized source but it did not work.
>
> 1) Created Maven project and added customized source. (flume.jar was
> created after this step)
> 2) Added flume.jar file to nifi-0.3.0/lib folder.
> 3) Added flume source processor with the below configuration
>
> Property   Value
> Source Type com.flume.source.Source
> Agent Name  a1
> Source Name k1.
>
> But I am getting the below error in Flume Source Processor.
> "Failed to run validation due to java.lang.NoClassDefFoundError :
> /org/apache/flume/PollableSource."
>
> Can you please help me in this regard. Any step/configuration I missed.
>
> Thanks and Regards,
> Parul
>
>
> On Wed, Oct 7, 2015 at 6:57 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Hello,
>>
>> The NiFi Flume processors are for running Flume sources and sinks with in
>> NiFi. They don't communicate with an external Flume process.
>>
>> In your example you would need an ExecuteFlumeSource configured to run
>> the netcat source, connected to a ExecuteFlumeSink configured with the
>> logger.
>>
>> -Bryan
>>
>> On Wednesday, October 7, 2015, Parul Agrawal <parulagrawa...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I was trying to run Nifi Flume processor with the below mentioned
>>> details but not could bring it up.
>>>
>>> I already started flume with the sample configuration file
>>> =
>>> # example.conf: A single-node Flume configuration
>>>
>>> # Name the components on this agent
>>> a1.sources = r1
>>> a1.sinks = k1
>>> a1.channels = c1
>>>
>>> # Describe/configure the source
>>> a1.sources.r1.type = netcat
>>> a1.sources.r1.bind = localhost
>>> a1.sources.r1.port = 4
>>>
>>> # Describe the sink
>>> a1.sinks.k1.type = logger
>>>
>>> # Use a channel which buffers events in memory
>>> a1.channels.c1.type = memory
>>> a1.channels.c1.capacity = 1000
>>> a1.channels.c1.transactionCapacity = 100
>>>
>>> # Bind the source and sink to the channel
>>> a1.sources.r1.channels = c1
>>> a1.sinks.k1.channel = c1
>>> =
>>>
>>> Command used to start flume : $ bin/flume-ng agent --conf conf
>>> --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
>>>
>>> In the Nifi browser of ExecuteFlumeSink following configuration was done:
>>> Property   Value
>>> Sink Type logger
>>> Agent Name  a1
>>> Sink Name k1.
>>>
>>> Event is sent to the flume using:
>>> $ telnet localhost 4
>>> Trying 127.0.0.1...
>>> Connected to localhost.localdomain (127.0.0.1).
>>> Escape character is '^]'.
>>> Hello world! 
>>> OK
>>>
>>> But I could not get any data in the nifi flume processor. Request your
>>> help in this.
>>> Do i need to change the example.conf file of flume so that Nifi Flume
>>> Sink should get the data.
>>>
>>> Thanks and Regards,
>>> Parul
>>>
>>
>>
>> --
>> Sent from Gmail Mobile
>>
>
>

Re: Nifi & Spark receiver performance configuration

2015-10-08 Thread Bryan Bende

Hello,

When you say you were unhappy with the performance, can you give some more
information about what was not performing well?

Was the NiFi Spark Receiver not pulling messages in fast enough and they
were queuing up in NiFi?
Was NiFi not producing messages as fast as you expected?
What kind of environment were you running this? All on a local machine for
testing?

-Bryan

On Thu, Oct 8, 2015 at 6:52 AM, Aurélien DEHAY 
wrote:

> Hello.
>
>
>
> I’m doing some experimentations on Apache Nifi to see where we can use it.
>
>
>
> One idea is to use nifi to feed a spark cluster. So I’m doing some simple
> test (GenerateFlowFile => spark output port and a simple word count on
> spark side.
>
>
>
> I was pretty unhappy with the performance out of the box, so I looked on
> the net and found almost nothing.
>
>
>
> So I looked at nifi.properties, and found that some of the following
> properties have a huge impact on how many messages / second were processed
> to Spark :
>
>
>
> nifi.queue.swap.threshold=2
>
> nifi.swap.in.period=1 sec
>
> nifi.swap.in.threads=1
>
> nifi.swap.out.period=1 sec
>
> nifi.swap.out.threads=4
>
>
>
> The documentation seems unclear on this point for output ports, is anyone
> have a pointer for me ?
>
>
>
> Thanks.
>
>
>
> Aurélien.
>

Re: output port

2015-10-19 Thread Bryan Bende

Hello,

Just to clarify, so you are seeing the messages reach the output port and
then get removed from the queue? And on the spark side the NiFi Spark
receiver never receives anything? Or it receives message, but they have no
content?

-Bryan

On Monday, October 19, 2015, Rama Krishna Manne 
wrote:

> I have an flow in which messages are emitted to an output and apache-spark
> will pull the messages from the port , I see the messages are pulled by
> spark but cannot see the data pulled(cannot do any computations)
>
> I tried a different way , the messages are pushed apache kafka and spark
> pulls messages from kafka queue instead of output port and this one worked.
>
> my question is why it didn't work for output port  and using kafka instead
> of output port is it a good flow ?
>
>
>
>

-- 
Sent from Gmail Mobile

Re: output port

2015-10-20 Thread Bryan Bende

Ok, can you describe the flow in NiFi leading up to the output port? What
kind of data is in the content of the FlowFiles?

On Tuesday, October 20, 2015, Rama Krishna Manne <
chaitanya.mann...@gmail.com> wrote:

> so you are seeing the messages reach the output port and then get removed
> from the queue
>
> yes
>
> And on the spark side the NiFi Spark receiver never receives anything? Or
> it receives message, but they have no content?
>
> It receives the data but no content to do computation
>
> On Mon, Oct 19, 2015 at 2:14 PM, Bryan Bende <bbe...@gmail.com
> <javascript:_e(%7B%7D,'cvml','bbe...@gmail.com');>> wrote:
>
>> Hello,
>>
>> Just to clarify, so you are seeing the messages reach the output port and
>> then get removed from the queue? And on the spark side the NiFi Spark
>> receiver never receives anything? Or it receives message, but they have no
>> content?
>>
>> -Bryan
>>
>>
>> On Monday, October 19, 2015, Rama Krishna Manne <
>> chaitanya.mann...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','chaitanya.mann...@gmail.com');>> wrote:
>>
>>> I have an flow in which messages are emitted to an output and
>>> apache-spark will pull the messages from the port , I see the messages are
>>> pulled by spark but cannot see the data pulled(cannot do any computations)
>>>
>>> I tried a different way , the messages are pushed apache kafka and spark
>>> pulls messages from kafka queue instead of output port and this one worked.
>>>
>>> my question is why it didn't work for output port  and using kafka
>>> instead of output port is it a good flow ?
>>>
>>>
>>>
>>>
>>
>> --
>> Sent from Gmail Mobile
>>
>
>

-- 
Sent from Gmail Mobile

Re: Need help in nifi- flume processor

2015-10-12 Thread Bryan Bende

6.0.jar:1.6.0]
>>> at org.apache.flume.sink.LoggerSink.process(LoggerSink.java:105)
>>> ~[flume-ng-core-1.6.0.jar:1.6.0]
>>> at
>>> org.apache.nifi.processors.flume.ExecuteFlumeSink.onTrigger(ExecuteFlumeSink.java:139)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.processors.flume.AbstractFlumeProcessor.onTrigger(AbstractFlumeProcessor.java:148)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1077)
>>> ~[nifi-framework-core-0.3.0.jar:0.3.0]
>>> at
>>> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:127)
>>> [nifi-framework-core-0.3.0.jar:0.3.0]
>>> at
>>> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:49)
>>> [nifi-framework-core-0.3.0.jar:0.3.0]
>>> at
>>> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:119)
>>> [nifi-framework-core-0.3.0.jar:0.3.0]
>>> at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> [na:1.7.0_85]
>>> at
>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>> [na:1.7.0_85]
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>> [na:1.7.0_85]
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>> [na:1.7.0_85]
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> [na:1.7.0_85]
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> [na:1.7.0_85]
>>> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_85]
>>> 2015-10-10 02:30:46,029 ERROR [Timer-Driven Process Thread-9]
>>> o.a.n.processors.flume.ExecuteFlumeSink
>>> ExecuteFlumeSink[id=2d08dfe7-4fd1-4a10-9d25-0b007a2c41bf]
>>> ExecuteFlumeSink[id=2d08dfe7-4fd1-4a10-9d25-0b007a2c41bf] failed to process
>>> due to org.apache.nifi.processor.exception.FlowFileHandlingException:
>>> StandardFlowFileRecord[uuid=8832b036-51a4-49cf-9703-fc4ed443ab80,claim=StandardContentClaim
>>> [resourceClaim=StandardResourceClaim[id=162207782-7, container=default,
>>> section=7], offset=180436,
>>> length=14078],offset=0,name=8311685679474355,size=14078] is not known in
>>> this session (StandardProcessSession[id=218318]); rolling back session:
>>> org.apache.nifi.processor.exception.FlowFileHandlingException:
>>> StandardFlowFileRecord[uuid=8832b036-51a4-49cf-9703-fc4ed443ab80,claim=StandardContentClaim
>>> [resourceClaim=StandardResourceClaim[id=162207782-7, container=default,
>>> section=7], offset=180436,
>>> length=14078],offset=0,name=8311685679474355,size=14078] is not known in
>>> this session (StandardProcessSession[id=218318])
>>>
>>> Any idea what could be wrong in this.
>>>
>>> Thanks and Regards,
>>> Parul
>>>
>>>
>>> On Fri, Oct 9, 2015 at 6:32 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>>> Hi Parul,
>>>>
>>>> I think it would be good to keep the convo going on the users list
>>>> since there are more people who can offer help there, and also helps
>>>> everyone learn new solutions.
>>>>
>>>> The quick answer though is that NiFi has an ExecuteProcess processor
>>>> which could execute "tshark -i eth0 -T pdml".
>>>>
>>>> There is not currently an XmlToJson processor, so this could be a place
>>>> where you need a custom processor. For simple cases you can use an
>>>> EvaluateXPath processor to extract values from the XML, and then a
>>>> ReplaceText processor to build a new json document from those extracted
>>>> values.
>>>>
>>>> -Bryan
>>>>
>>>>
>>>> On Fri, Oct 9, 2015 at 3:39 AM, Parul Agrawal <parulagrawa...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Little more to add.
>>>>>  I need to keep reading the flowfile till END_TAG is received. i.e. we
>>>>> may need to concatenate the flowfile data till END_TAG.
>>&

Re: Need help in nifi- flume processor

2015-10-07 Thread Bryan Bende

Hello,

The NiFi Flume processors are for running Flume sources and sinks with in
NiFi. They don't communicate with an external Flume process.

In your example you would need an ExecuteFlumeSource configured to run the
netcat source, connected to a ExecuteFlumeSink configured with the logger.

-Bryan

On Wednesday, October 7, 2015, Parul Agrawal 
wrote:

> Hi,
>
> I was trying to run Nifi Flume processor with the below mentioned
> details but not could bring it up.
>
> I already started flume with the sample configuration file
> =
> # example.conf: A single-node Flume configuration
>
> # Name the components on this agent
> a1.sources = r1
> a1.sinks = k1
> a1.channels = c1
>
> # Describe/configure the source
> a1.sources.r1.type = netcat
> a1.sources.r1.bind = localhost
> a1.sources.r1.port = 4
>
> # Describe the sink
> a1.sinks.k1.type = logger
>
> # Use a channel which buffers events in memory
> a1.channels.c1.type = memory
> a1.channels.c1.capacity = 1000
> a1.channels.c1.transactionCapacity = 100
>
> # Bind the source and sink to the channel
> a1.sources.r1.channels = c1
> a1.sinks.k1.channel = c1
> =
>
> Command used to start flume : $ bin/flume-ng agent --conf conf
> --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console
>
> In the Nifi browser of ExecuteFlumeSink following configuration was done:
> Property   Value
> Sink Type logger
> Agent Name  a1
> Sink Name k1.
>
> Event is sent to the flume using:
> $ telnet localhost 4
> Trying 127.0.0.1...
> Connected to localhost.localdomain (127.0.0.1).
> Escape character is '^]'.
> Hello world! 
> OK
>
> But I could not get any data in the nifi flume processor. Request your
> help in this.
> Do i need to change the example.conf file of flume so that Nifi Flume
> Sink should get the data.
>
> Thanks and Regards,
> Parul
>


-- 
Sent from Gmail Mobile

Re: Need help in nifi- flume processor

2015-10-13 Thread Bryan Bende

Parul,

You can use SplitJson to take a large JSON document and split an array
element into individual documents. I took the json you attached and created
a flow like GetFile -> SplitJson -> SplitJson -> PutFile

In the first SplitJson the path I used was $.packet.proto and in the second
I used $.field  This seemed to mostly work except some of the splits going
into PutFile still have another level of "field" which needs to be split
again so would possibly need some conditional logic to split certain
documents again.

Alternatively you could write a custom processor that restructures your
JSON.

-Bryan



On Tue, Oct 13, 2015 at 8:36 AM, Parul Agrawal <parulagrawa...@gmail.com>
wrote:

> Hi,
>
> I tried with the above json element. But I am getting the below mentioned
> error:
>
> 2015-10-12 23:53:39,209 ERROR [Timer-Driven Process Thread-9]
> o.a.n.p.standard.ConvertJSONToSQL
> ConvertJSONToSQL[id=0e964781-6914-486f-8bb7-214c6a1cd66e] Failed to parse
> StandardFlowFileRecord[uuid=dfc16db0-c7a6-4e9e-8b4d-8c5b4ec50742,claim=StandardContentClaim
> [resourceClaim=StandardResourceClaim[id=183036971-1, container=default,
> section=1], offset=132621, length=55],offset=0,name=json,size=55] as JSON
> due to org.apache.nifi.processor.exception.ProcessException: IOException
> thrown from ConvertJSONToSQL[id=0e964781-6914-486f-8bb7-214c6a1cd66e]:
> org.codehaus.jackson.JsonParseException: Unexpected character ('I' (code
> 73)): expected a valid value (number, String, array, object, 'true',
> 'false' or 'null')
>
> Also I have a huge json object attached (new.json). Can you guide me on
> how do i use ConvertJSONToSQL processor.
> Should I use any other processor before using ConvertJSONToSQL processor
> so that this new.json can be converted in to a flat document of key/value
> pairs, or an array of flat documents.
>
> Any help/guidance would be really useful.
>
> Thanks and Regards,
> Parul
>
> On Mon, Oct 12, 2015 at 10:36 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> I think ConvertJSONToSQL expects a flat document of key/value pairs, or
>> an array of flat documents. So I think your JSON would be:
>>
>> [
>> {"firstname":"John", "lastname":"Doe"},
>> {"firstname":"Anna", "lastname":"Smith"}
>> ]
>>
>> The table name will come from the Table Name property.
>>
>> Let us know if this doesn't work.
>>
>> -Bryan
>>
>>
>> On Mon, Oct 12, 2015 at 12:19 PM, Parul Agrawal <parulagrawa...@gmail.com
>> > wrote:
>>
>>> Hi,
>>>
>>> Thank you very much for all the support.
>>> I could able to convert XML format to json  using custom flume source.
>>>
>>> Now I would need ConvertJSONToSQL processor to insert data into SQL.
>>> I am trying to get hands-on on this processor. Will update you on this.
>>> Meanwhile if any example you could share to use this processor for a
>>> sample
>>> json data, then it would be great.
>>>
>>> ===
>>>
>>> 1) I tried using ConvertJSONToSQL processor with the below sample json
>>> file:
>>>
>>> "details":[
>>> {"firstname":"John", "lastname":"Doe"},
>>> {"firstname":"Anna", "lastname":"Smith"}
>>> ]
>>>
>>> 2) I created table *details *in the postgreSQL
>>> * select * from details ;*
>>> * firstname | lastname*
>>> *---+--*
>>> *(0 rows)*
>>>
>>> 3) ConvertJSONToSQL Processor property details are as below:
>>> *Property  *   *Value*
>>> JDBC Connection PoolInfoDBCPConnectionPool
>>> Statement TypeInfo  INSERT
>>> Table NameInfodetails
>>> Catalog NameInfo No value set
>>> Translate Field NamesInfo false
>>> Unmatched Field BehaviorInfo   Ignore Unmatched Fields
>>> Update KeysInfo   No value set
>>>
>>> But I am getting the below mentioned error in ConvertJSONToSQL Processor.
>>> 2015-10-12 05:15:19,584 ERROR [Timer-Driven Process Thread-1]
>>> o.a.n.p.standard.ConvertJSONToSQL
>>> ConvertJSONToSQL[id=0e964781-6914-486f-8bb7-214c6a1cd66e] Failed to convert
>>> StandardFlowFileRecord[uuid=3a58716b-1474-4d75-91c1-e2fc3b9175ba,claim=StandardContentClaim
>>> [resourceClaim=StandardResourceClaim[id=183036

Re: Solr Cloud support

2015-09-03 Thread Bryan Bende

Srikanth/Joe,

I think I understand the scenario a little better now, and to Joe's points
- it will probably be clearer how to do this in a more generic way as we
work towards the High-Availability NCM.

Thinking out loud here given the current state of things, I'm wondering if
the desired functionality could be achieve by doing something similar to
ListHDFS and FetchHDFS... what if there was a DistributeSolrCommand and
ExecuteSolrCommand?

DistributeSolrCommand would be set to run on the Primary Node and would be
configured with similar properties to what GetSolr has now (zk hosts, a
query, timestamp field, distrib=false, etc), it would query ZooKeeper and
produce a FlowFile for each shard, and the FlowFile would use either the
attributes, or payload, to capture all of the processor property values
plus the shard information, basically producing a command for a downstream
processor to run.

ExecuteSolrCommand would be running on every node and would be responsible
for interpreting the incoming FlowFile and executing whatever operation was
being specified, and then passing on the results.

In a cluster you would likely set this up by having DistributeSolrCommand
send to a Remote Process Group that points back to an input port of itself,
and the input port feeds into ExecuteSolrCommand.
This would get the automatic querying of each shard, but you would still
have DistributeSolrCommand running on one node and needing to be manually
failed over, until we address the HA stuff.

This would be a fair amount of work, but food for thought.

-Bryan

On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote:

> <- general commentary not specific to the solr case ->
>
> This concept of being able to have nodes share information about
> 'which partition' they should be responsible for is a generically
> useful and very powerful thing.  We need to support it.  It isn't
> immediately obvious to me how best to do this as a generic and useful
> thing but a controller service on the NCM could potentially assign
> 'partitions' to the nodes.  Zookeeper could be an important part.  I
> think we need to tackle the HA NCM construct we talked about months
> ago before we can do this one nicely.
>
> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote:
> > Bryan,
> >
> >  --> "I'm still a little bit unclear about the use case for
> querying
> > the shards individually... is the reason to do this because of a
> > performance/failover concern?"
> >  --> Reason to do this is to achieve better performance with
> the
> > convenience of automatic failover.
> > In the current mode, we do get very good failover offered by Solr.
> Failover
> > is seamless.
> > At the same time, we are not getting best performance. I guess its clear
> to
> > us why having each NiFi process query each shard with distrib=false will
> > give better performance.
> >
> > Now, question is how do we achieve this. Making user configure one NiFi
> > processor for each Solr node is one way to go.
> > I'm afraid this will make failover a tricky process. May even need human
> > intervention.
> >
> > Another approach is to have cluster master in NiFi talk to ZK and decide
> > which shards to query. Divide these shards among slave nodes.
> > My understanding is NiFi cluster master is not indented for such purpose.
> > I'm not sure if this even possible.
> >
> > Hope I'm a bit more clear now.
> >
> > Srikanth
> >
> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
> >>
> >> Srikanth,
> >>
> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
> both
> >> the dev and users list now :)
> >>
> >> I'm still a little bit unclear about the use case for querying the
> shards
> >> individually... is the reason to do this because of a
> performance/failover
> >> concern? or is it something specific about how the data is shared?
> >>
> >> Lets say you have your Solr cluster with 10 shards, each on their own
> node
> >> for simplicity, and then your ZooKeeper cluster.
> >> Then you also have a NiFi cluster with 3 nodes each with their own nifi
> >> instance, the first node designated as the primary, and a fourth node
> as the
> >> cluster manager.
> >>
> >> Now if you want to extract data from your Solr cluster, you would do the
> >> following...
> >> - Drag GetSolr on to the graph
> >> - Set type to "cloud"
> >> - Set the Solr Location to the ZK hosts string
> >> - Set the scheduling to "Primary Node"
> >>
> >>

Re: MergeContent Question

2015-10-01 Thread Bryan Bende

Hi Chris,

In the MergeContent case, instead of putting \n in the file, try putting a
new line by hitting return.

I remembering doing this once and I had created the empty file with vi and
added a new line, and then I actually got 2 new lines in my output because
I guess vi has a new line character in there by default. I think this was
specific to vi but thought I'd mention it.

-Bryan

On Thu, Oct 1, 2015 at 5:12 PM, Christopher Wilson 
wrote:

> I posted a question earlier today regarding pulling JSON out of a string
> from a log file.  I want to aggregate those logs into a file using
> ReplaceText and MergeContent processors.  I'm having an issue where the
> aggregated logs are concatenated into one large string instead of line by
> line (with a newline added to the end).
>
> I've tried adding a '\n' after the replacement value in ReplaceText and
> added the same string to a file in the Demarcator value in MergeContent but
> still getting one REALLY long line in the output file.
>
> Any help on the proper settings appreciated.
>
> -Chris
>

Re: Array into MongoDB

2015-09-24 Thread Bryan Bende

One other thing I thought of... I think the JSON processors read the entire
FlowFile content into memory to do the splitting/evaluating, so I wonder if
you are running into a memory issue with a 180MB JSON file.

Are you running with the default configuration of 512mb set in
conf/bootstrap.conf ?  If so it would be interesting to see what happens if
you bump that up.

On Thu, Sep 24, 2015 at 5:06 PM, Bryan Bende <bbe...@gmail.com> wrote:

> Adam,
>
> Based on that message I suspect that MongoDB does not support sending in
> an array of documents since it looks like it expect the first character to
> be the start of a document and not an array.
>
> With regards to the SplitJson processor, if you set the JSON Path to $
> then it should split at the top-level and send out each of your two
> documents on the splits relationship.
>
> -Bryan
>
>
> On Thu, Sep 24, 2015 at 4:36 PM, Adam Williams <aaronfwilli...@outlook.com
> > wrote:
>
>> I have an array of JSON object I am trying to put into Mongo, but I keep
>> hitting this on the PutMongo processor:
>>
>> ERROR [Timer-Driven Process Thread-1]
>> o.a.nifi.processors.mongodb.PutMongo
>> PutMongo[id=c576f8cc-6e21-4881-a7cd-6e3881838a91] Failed to insert
>> StandardFlowFileRecord[uuid=2c670a40-7934-4bc6-b054-1cba23fe7b0f,claim=StandardContentClaim
>> [resourceClaim=StandardResourceClaim[id=1443125646319-1, container=default,
>> section=1], offset=0,
>> length=208380820],offset=0,name=test.json,size=208380820] into MongoDB due
>> to org.bson.BsonInvalidOperationException: readStartDocument can only be
>> called when CurrentBSONType is DOCUMENT, not when CurrentBSONType is
>> ARRAY.: org.bson.BsonInvalidOperationException: readStartDocument can only
>> be called when CurrentBSONType is DOCUMENT, not when CurrentBSONType is
>> ARRAY.
>>
>>
>>
>> I tried to use the splitJson processor to split the array into segments,
>> but to my experience I can't pull out each Json Obect.  The splitjson
>> processor just hangs and never produces logs or any output at all.  The
>> structure of my data is:
>>
>>
>> [{"id":1, "stat":"something"},{"id":2, "stat":"anothersomething"}]
>>
>>
>> The JSON file itself is pretty large (>100mb).
>>
>>
>> Thank you
>>
>
>

Re: Generate flowfiles from flowfile content

2015-09-23 Thread Bryan Bende

Sorry I missed Joe's email while sending mine... I can put together a
template showing this.

On Wednesday, September 23, 2015, Bryan Bende <bbe...@gmail.com> wrote:

> David,
>
> Take a look at ExtractText, it is for pulling FlowFile content into
> attributes. I think that will do what you are looking for.
>
> -Bryan
>
> On Wednesday, September 23, 2015, David Klim <davidkl...@hotmail.com
> <javascript:_e(%7B%7D,'cvml','davidkl...@hotmail.com');>> wrote:
>
>> Hello Bryan,
>>
>> I should have been more specific. What I am trying to do is to fetch
>> files from S3. I am using the GetSQS processor to get new object (files)
>> events, and each event is a json containing the list of new objects (files)
>> in the bucket. The output of the GetSQS is processed by SplitJson and I get
>> flowfiles containing one object key (filename) each. I need to feed this
>> into FetchS3Object to retrive the actual file, but FetchS3Object expects
>> the flowfile filename attribute (or any other) to be the filename. So I
>> guess the problem is moving the filename string from the flowfile content
>> to some attribute.
>>
>> If there is no other alternative, I will implement this processor.
>>
>> Thanks!
>>
>> --
>> From: rbra...@softnas.com
>> To: users@nifi.apache.org
>> Subject: RE: Generate flowfiles from flowfile content
>> Date: Wed, 23 Sep 2015 19:59:21 +
>>
>> Good idea, Adam.
>>
>>
>>
>> I will post a separate review thread on the dev@ list to track comments.
>>
>>
>>
>> Here’s the repository link:  https://github.com/rickbraddy/nifishare
>>
>>
>>
>>
>>
>> Thanks
>>
>> Rick
>>
>>
>>
>> *From:* Adam Taft [mailto:a...@adamtaft.com]
>> *Sent:* Wednesday, September 23, 2015 1:48 PM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: Generate flowfiles from flowfile content
>>
>>
>>
>> Not speaking for the entire community, but I am sure that such a
>> contribution would (at minimum) be appreciated for review, consideration
>> and potential inclusion.  The best thing would be ideally hosting the
>> source code somewhere that the rest of the community could go to for
>> review.  Maybe you could host the GetFileData and PutFileData processors on
>> a GitHub repository somewhere?
>>
>> I think the idea you proposed is good, but might need to be aligned with
>> the work (if any) for the referenced ListFile and FetchFile
>> implementation.  And the differences in your PutFileData vs. PutFile would
>> ideally be well vetted as well.
>>
>> Adam
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Sep 23, 2015 at 2:23 PM, Rick Braddy <rbra...@softnas.com> wrote:
>>
>> We have already developed modified a modified GetFIle called GetFileData
>> that takes an incoming FlowFile containing the path to the file/directory
>> that needs to be transferred.  There is a corresponding PutFileData on the
>> other side that accepts the incoming file/directory that creates the
>> directory/tree as needed or writes the file, then sets the permissions and
>> ownership.  GetFileData also receives a file.rootdir attribute that gets
>> passed along to PutFileData, so it can rebase the original file’s location
>> relative to the configured target directory.  Unlike GetFile/PutFile, these
>> processor work with entire directory trees and are triggered by incoming
>> FlowFiles to GetFileData.
>>
>>
>>
>> Eventually, we want to further enhance these two processors so they can
>> break large files into “chunks” and send as multi-part files that get
>> reassembled by PutFileData, resolving the limitations associated with huge
>> files and content repository size; e.g., there are default 100MB chunk
>> threshold and 10MB chunk size properties that will control the chunking, if
>> enabled.
>>
>>
>>
>> If the community is interested would benefit from these processors, we’re
>> happy to consider further generalizing and contributing these processors,
>> along with any further refinements based upon community review and feedback.
>>
>>
>>
>> I believe these processors would address both the Jira and David’s
>> original inquiry.
>>
>>
>>
>> Rick
>>
>>
>>
>> *From:* Adam Taft [mailto:a...@adamtaft.com]
>> *Sent:* Wednesday, September 23, 2015 1:09 PM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: Generate flowfiles from flowfile content
&g

Re: Nifi cluster features - Questions

2016-01-09 Thread Bryan Bende

The sending node (where the remote process group is) will distribute the
data evenly across the two nodes, so an individual file will only be sent
to one of the nodes. You could think of it as if a separate NiFi instance
was sending directly to a two node cluster, it would be evenly distributing
the data across the two nodes. In this case it just so happens to all be
with in the same cluster.

The most common use case for this scenario is the List and Fetch processors
like HDFS. You can perform the listing on primary node, and then distribute
the results so the fetching takes place on all nodes.

On Saturday, January 9, 2016, Chakrader Dewaragatla <
chakrader.dewaraga...@lifelock.com> wrote:

> Bryan – Thanks, how do the nodes distribute the load for a input port. As
> port is open and listening on two nodes,  does it copy same files on both
> the nodes?
> I need to try this setup to see the results, appreciate your help.
>
> Thanks,
> -Chakri
>
> From: Bryan Bende <bbe...@gmail.com
> <javascript:_e(%7B%7D,'cvml','bbe...@gmail.com');>>
> Reply-To: "users@nifi.apache.org
> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>" <
> users@nifi.apache.org
> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>>
> Date: Friday, January 8, 2016 at 3:44 PM
> To: "users@nifi.apache.org
> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>" <
> users@nifi.apache.org
> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>>
> Subject: Re: Nifi cluster features - Questions
>
> Hi Chakri,
>
> I believe the DistributeLoad processor is more for load balancing when
> sending to downstream systems. For example, if you had two HTTP endpoints,
> you could have the first relationship from DistributeLoad going to a
> PostHTTP that posts to endpoint #1, and the second relationship going to a
> second PostHTTP that goes to endpoint #2.
>
> If you want to distribute the data with in the cluster, then you need to
> use site-to-site. The way you do this is the following...
>
> - Add an Input Port connected to your PutFile.
> - Add GenerateFlowFile scheduled on primary node only, connected to a
> Remote Process Group. The Remote Process Group should be connected to the
> Input Port from the previous step.
>
> So both nodes have an input port listening for data, but only the primary
> node produces a FlowFile and sends it to the RPG which then re-distributes
> it back to one of the Input Ports.
>
> In order for this to work you need to set nifi.remote.input.socket.port in
> nifi.properties to some available port, and you probably want
> nifi.remote.input.secure=false for testing.
>
> -Bryan
>
>
> On Fri, Jan 8, 2016 at 6:27 PM, Chakrader Dewaragatla <
> chakrader.dewaraga...@lifelock.com
> <javascript:_e(%7B%7D,'cvml','chakrader.dewaraga...@lifelock.com');>>
> wrote:
>
>> Mark – I have setup a two node cluster and tried the following .
>>  GenrateFlowfile processor (Run only on primary node) —> DistributionLoad
>> processor (RoundRobin)   —> PutFile
>>
>> >> The GetFile/PutFile will run on all nodes (unless you schedule it to
>> run on primary node only).
>> From your above comment, It should put file on two nodes. It put files on
>> primary node only. Any thoughts ?
>>
>> Thanks,
>> -Chakri
>>
>> From: Mark Payne <marka...@hotmail.com
>> <javascript:_e(%7B%7D,'cvml','marka...@hotmail.com');>>
>> Reply-To: "users@nifi.apache.org
>> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>" <
>> users@nifi.apache.org
>> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>>
>> Date: Wednesday, October 7, 2015 at 11:28 AM
>>
>> To: "users@nifi.apache.org
>> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>" <
>> users@nifi.apache.org
>> <javascript:_e(%7B%7D,'cvml','users@nifi.apache.org');>>
>> Subject: Re: Nifi cluster features - Questions
>>
>> Chakri,
>>
>> Correct - when NiFi instances are clustered, they do not transfer data
>> between the nodes. This is very different
>> than you might expect from something like Storm or Spark, as the key
>> goals and design are quite different.
>> We have discussed providing the ability to allow the user to indicate
>> that they want to have the framework
>> do load balancing for specific connections in the background, but it's
>> still in more of a discussion phase.
>>
>> Site-to-Site is simply the capability that we have developed to transfer
>> data between one instance of
>> NiFi and another instance of NiFi. So

Re: Nifi cluster features - Questions

2016-01-08 Thread Bryan Bende

Hi Chakri,

I believe the DistributeLoad processor is more for load balancing when
sending to downstream systems. For example, if you had two HTTP endpoints,
you could have the first relationship from DistributeLoad going to a
PostHTTP that posts to endpoint #1, and the second relationship going to a
second PostHTTP that goes to endpoint #2.

If you want to distribute the data with in the cluster, then you need to
use site-to-site. The way you do this is the following...

- Add an Input Port connected to your PutFile.
- Add GenerateFlowFile scheduled on primary node only, connected to a
Remote Process Group. The Remote Process Group should be connected to the
Input Port from the previous step.

So both nodes have an input port listening for data, but only the primary
node produces a FlowFile and sends it to the RPG which then re-distributes
it back to one of the Input Ports.

In order for this to work you need to set nifi.remote.input.socket.port in
nifi.properties to some available port, and you probably want
nifi.remote.input.secure=false for testing.

-Bryan


On Fri, Jan 8, 2016 at 6:27 PM, Chakrader Dewaragatla <
chakrader.dewaraga...@lifelock.com> wrote:

> Mark – I have setup a two node cluster and tried the following .
>  GenrateFlowfile processor (Run only on primary node) —> DistributionLoad
> processor (RoundRobin)   —> PutFile
>
> >> The GetFile/PutFile will run on all nodes (unless you schedule it to
> run on primary node only).
> From your above comment, It should put file on two nodes. It put files on
> primary node only. Any thoughts ?
>
> Thanks,
> -Chakri
>
> From: Mark Payne 
> Reply-To: "users@nifi.apache.org" 
> Date: Wednesday, October 7, 2015 at 11:28 AM
>
> To: "users@nifi.apache.org" 
> Subject: Re: Nifi cluster features - Questions
>
> Chakri,
>
> Correct - when NiFi instances are clustered, they do not transfer data
> between the nodes. This is very different
> than you might expect from something like Storm or Spark, as the key goals
> and design are quite different.
> We have discussed providing the ability to allow the user to indicate that
> they want to have the framework
> do load balancing for specific connections in the background, but it's
> still in more of a discussion phase.
>
> Site-to-Site is simply the capability that we have developed to transfer
> data between one instance of
> NiFi and another instance of NiFi. So currently, if we want to do load
> balancing across the cluster, we would
> create a site-to-site connection (by dragging a Remote Process Group onto
> the graph) and give that
> site-to-site connection the URL of our cluster. That way, you can push
> data to your own cluster, effectively
> providing a load balancing capability.
>
> If you were to just run ListenHTTP without setting it to Primary Node,
> then every node in the cluster will be listening
> for incoming HTTP connections. So you could then use a simple load
> balancer in front of NiFi to distribute the load
> across your cluster.
>
> Does this help? If you have any more questions we're happy to help!
>
> Thanks
> -Mark
>
>
> On Oct 7, 2015, at 2:32 PM, Chakrader Dewaragatla <
> chakrader.dewaraga...@lifelock.com> wrote:
>
> Mark - Thanks for the notes.
>
> >> The other option would be to have a ListenHTTP processor run on Primary
> Node only and then use Site-to-Site to distribute the data to other nodes.
> Lets say I have 5 node cluster and ListenHTTP processor on Primary node,
> collected data on primary node is not transfered to other nodes by default
> for processing despite all nodes are part of one cluster?
> If ListenHTTP processor is running  as a dafult (with out explicit setting
> to run on primary node), how does the data transferred to rest of the
> nodes? Does site-to-site come in play when I make one processor to run on
> primary node ?
>
> Thanks,
> -Chakri
>
> From: Mark Payne 
> Reply-To: "users@nifi.apache.org" 
> Date: Wednesday, October 7, 2015 at 7:00 AM
> To: "users@nifi.apache.org" 
> Subject: Re: Nifi cluster features - Questions
>
> Hello Chakro,
>
> When you create a cluster of NiFi instances, each node in the cluster is
> acting independently and in exactly
> the same way. I.e., if you have 5 nodes, all 5 nodes will run exactly the
> same flow. However, they will be
> pulling in different data and therefore operating on different data.
>
> So if you pull in 10 1-gig files from S3, each of those files will be
> processed on the node that pulled the data
> in. NiFi does not currently shuffle data around between nodes in the
> cluster (you can use site-to-site to do
> this if you want to, but it won't happen automatically). If you set the
> number of Concurrent Tasks to 5, then
> you will have up to 5 threads running for that processor on each node.
>
> The only exception to this is the Primary Node. You can schedule a
> Processor

Re: Use Case Validation

2015-12-23 Thread Bryan Bende

Hi Dan,

This is definitely a use case that NiFi can handle.

A possible architecture for your scenario would be something like the
following...
- Run NiFi instances on the machines where you need to collect logs, these
would not be clustered, just stand-alone instances.
- These would pick up your log files using List/FetchFile, or TailFile, and
send them to a central NiFi using Site-to-Site [1]
- The central NiFi would be receiving the data from all the machines and
making the routing decisions as to which Azure hub to send to.
- Depending on your data volume, the central NiFi could be a cluster of a
few nodes, or for a lower volume it could be a stand-alone instance.

Let us know if you have any questions.

-Bryan

[1] https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site


On Wed, Dec 23, 2015 at 10:52 AM, Dan  wrote:

> I've recently found NiFi and have been playing around with it locally for
> a day or so to assess whether it would be a good fit for the following use
> case:
>
> 1. I'm tasked with gathering log files from 100s of machines from a
> predetermined directory structure local to the machine (e.g. /log/appname/
> or c:\log\appname) which may be Linux or Windows
> 2. File names include date (e.g. appname_20151223.log)
> 3. The log file is structured as JSON - each line of the file is a JSON
> object
> 4. The JSON object in each file includes data that determines where to
> route the message
> 5. Each message should be routed to one of several Azure Event Hubs based
> on #4
>
> Would I set up a single NiFi cluster to do this, or would I set up what
> would essentially be 100 NiFi clusters if I have 100 machines from which I
> want to gather logs from their local /log/appname directory?
>
> Thanks - this looks like a very well thought out project!
>
> Best
> Dan
>

Re: Use Case Validation

2015-12-23 Thread Bryan Bende

Dan,

A stand-alone instance is the default behavior. If you extract a NiFi
distribution and run "bin/nifi.sh start", without changing any of the
clustering related properties, then you get a single instance running on
port 8080 by default.

My thought behind sending them via site-to-site is to have a central
instance/cluster where you can monitor/change the routing part of the flow.
The flow running on the machines where the logs are would likely be a very
simple flow to grab some data and send back, so there wouldn't be as much
to see/change there.

-Bryan

On Wed, Dec 23, 2015 at 11:32 AM, Dan  wrote:

> Thanks; is the idea of sending the log file data via Site-to-Site to
> reduce load caused by making the routing decision on the machine containing
> the logs?
>
> Total newb question: How does one create a stand-alone instance? I wound
> up running 2 processes (node and server) as I started poking around. On the
> "server" process, I filled in the nifi.cluster.is.manager=true along with
> n.c.m.address and n.c.m.protocol.port while on the "node" process, I filled
> in nifi.cluster.is.node=true along with n.c.node.* and pointed the
> n.c.n.unicast.* stuff over to the manager values. Is there a simpler way?
> Can I do this with a single process running?
>
> Thanks
> Dan
>
> --
> Date: Wed, 23 Dec 2015 11:10:23 -0500
> Subject: Re: Use Case Validation
> From: bbe...@gmail.com
> To: users@nifi.apache.org
>
>
> Hi Dan,
>
> This is definitely a use case that NiFi can handle.
>
> A possible architecture for your scenario would be something like the
> following...
> - Run NiFi instances on the machines where you need to collect logs, these
> would not be clustered, just stand-alone instances.
> - These would pick up your log files using List/FetchFile, or TailFile,
> and send them to a central NiFi using Site-to-Site [1]
> - The central NiFi would be receiving the data from all the machines and
> making the routing decisions as to which Azure hub to send to.
> - Depending on your data volume, the central NiFi could be a cluster of a
> few nodes, or for a lower volume it could be a stand-alone instance.
>
> Let us know if you have any questions.
>
> -Bryan
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site
>
>
> On Wed, Dec 23, 2015 at 10:52 AM, Dan  wrote:
>
> I've recently found NiFi and have been playing around with it locally for
> a day or so to assess whether it would be a good fit for the following use
> case:
>
> 1. I'm tasked with gathering log files from 100s of machines from a
> predetermined directory structure local to the machine (e.g. /log/appname/
> or c:\log\appname) which may be Linux or Windows
> 2. File names include date (e.g. appname_20151223.log)
> 3. The log file is structured as JSON - each line of the file is a JSON
> object
> 4. The JSON object in each file includes data that determines where to
> route the message
> 5. Each message should be routed to one of several Azure Event Hubs based
> on #4
>
> Would I set up a single NiFi cluster to do this, or would I set up what
> would essentially be 100 NiFi clusters if I have 100 machines from which I
> want to gather logs from their local /log/appname directory?
>
> Thanks - this looks like a very well thought out project!
>
> Best
> Dan
>
>
>

Re: How to save the original JSON to Solr - PutSolr

2015-12-23 Thread Bryan Bende

Bob,

The field mappings with the json paths is actually something provided by
Solr. The PutSolr processor is just passing all those parameters over, and
Solr is doing the extraction, so we are limited here by what Solr provides.
I'm not aware of a way to tell Solr to select the whole document into one
field, but it may be possible.

An alternative might be to modify the json with in NiFi to add a new field
that contains the whole original document. The ExtractText processor would
likely be your starting point to select the whole content into an
attribute, then potentially ReplaceText to modify the original json.

I believe there is a ticket somewhere in jira for a ModifyJson processor
which would make this easier in the future, but doesn't exist right now.

-Bryan

On Wednesday, December 23, 2015, Bob Zhao  wrote:

> Hello,
>
> Currently, PutSolr can parse the JSON with path very easily.
> If I want to save those extracted items with the original raw JSON
> (another column) together,
> how the expression should be?
>
> f.20  json_raw_t:\
>
> doesn't work.
>
> Thanks,
> Bob
>
>
>

-- 
Sent from Gmail Mobile

Re: queued files

2015-11-19 Thread Bryan Bende

Charlie,

The behavior you described usually means that the processor encountered an
unexpected error which was thrown back to the framework which rolls back
the processing of that flow file and leaves it in the queue, as opposed to
an error it expected where it would usually route to a failure relationship.

Is the id that you see in the bulletin a uuid?

There should still be some provenance events for this FlowFile from the
previous points in the flow. If it looks like the uuid of the FlowFile,
that should be searchable from provenance using the search button on the
right. Let us know if we can help more.

-Bryan

On Thu, Nov 19, 2015 at 9:10 PM, Charlie Frasure 
wrote:

> I have a question on troubleshooting a flow.  I've built a flow with no
> exception routing, just trying to process the expected values first.  When
> a file exposes a problem with the logic in my flow, it queues up prior to
> the flow that is raising the bulletin.
>
> In the bulletin, I can see an id, but can't tell which file it is.  Data
> provenance doesn't seem to help as it passed the flow on the last
> processor, but hasn't been logged (to my knowledge) on the next one.
>
> Is there a way to match the bulletin back to a file without creating a
> route for failed files?
>

Re: Data Ingestion forLarge Source Files and Masking

2016-01-12 Thread Bryan Bende

Obaid,

I can't say for sure how much this would improve performance, but you might
want to wrap the OutputStream with BufferedOutputStream or BufferedWriter.
Would be curious to here if that helps.

A similar scenario from the standard processors is ReplaceText, here is one
example where it uses the StreamCallback:
https://github.com/apache/nifi/blob/ee14d8f9dd0c3f18920d910fcddd6d79b8b9f9cf/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ReplaceText.java#L337

-Bryan

On Tue, Jan 12, 2016 at 8:38 PM, obaidul karim  wrote:

> Hi Joe,
>
> Yes, I took consideration of existinh RAID and HW settings. We have 10G
> NIC for all hadoop intra-connectivity and the server in question is an edge
> node of our hadoop cluster.
> In production scenario we will use dedicated ETL servers having high
> performance(>500MB/s) local disks.
>
> Sharing a good news, I have successfully mask & load to HDFS 110 GB data
> using below flow:
>
> ExecuteProcess(touch and mv to input dir) > ListFile (1 thread) >
> FetchFile (1 thread) > maskColumn(4 threads) > PutHDFS (1 threads).
>
> * used 4 threads for masking and 1 for other because I found it is the
> slowest component.
>
> However, It seems to be too slow. It was processing 2GB files in  6
> minutes. It may be because of my masking algorithm(although masking
> algorithm is pretty simple FPE with some simple twist).
> However I want to be sure that the way I have written custom processor is
> the most efficient way. Please below code chunk and let me know whether it
> is the fastest way to process flowfiles (csv source files) which needs
> modifications on specific columns:
>
> * parseLine method contains logic for masking.
>
>flowfile = session.write(flowfile, new StreamCallback() {
> @Override
>public void process(InputStream in, OutputStream out) throws
> IOException {
>
> BufferedReader reader = new BufferedReader(new
> InputStreamReader(in));
> String line;
> if(skipHeader == true && headerExists==true) { // to skip header,
> do an additional line fetch before going to next step
> if(reader.ready())   reader.readLine();
> } else if( skipHeader == false && headerExists == true) { // if
> header is not skipped then no need to mask, just pass through
> if(reader.ready())
> out.write((reader.readLine()+"\n").getBytes());
> }
>
> // decide about empty line earlier
> while ((line = reader.readLine()) != null) {
> if(line.trim().length() > 0 ) {
> out.write( parseLine(line, seperator, quote, escape,
> maskColumns).getBytes() );
> }
> };
> out.flush();
>}
>});
>
>
>
>
> Thanks in advance.
> -Obaid
>
>
> On Tue, Jan 5, 2016 at 12:36 PM, Joe Witt  wrote:
>
>> Obaid,
>>
>> Really happy you're seeing the performance you need.  That works out
>> to about 110MB/s on average over that period.  Any chance you have a
>> 1GB NIC?  If you really want to have fun with performance tuning you
>> can use things like iostat and other commands to observe disk,
>> network, cpu.  Something else to consider too is the potential
>> throughput gains of multiple RAID-1 containers rather than RAID-5
>> since NiFi can use both in parallel.  Depends on your goals/workload
>> so just an FYI.
>>
>> A good reference for how to build a processor which does altering of
>> the data (transformation) is here [1].  It is a good idea to do a
>> quick read through that document.  Also, one of the great things you
>> can do as well is look at existing processors.  Some good examples
>> relevant to transformation are [2], [3], and [4] which are quite
>> simple stream transform types. Or take a look at [5] which is a more
>> complicated example.  You might also be excited to know that there is
>> some really cool work done to bring various languages into NiFi which
>> looks on track to be available in the upcoming 0.5.0 release which is
>> NIFI-210 [6].  That will provide a really great option to quickly
>> build transforms using languages like Groovy, JRuby, Javascript,
>> Scala, Lua, Javascript, and Jython.
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#enrich-modify-content
>>
>> [2]
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/Base64EncodeContent.java
>>
>> [3]
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/TransformXml.java
>>
>> [4]
>> https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/ModifyBytes.java
>>
>> [5]
>>

Re: "Processor requires an upstream connection" for FetchS3Object?

2016-01-12 Thread Bryan Bende

Russell/Corey,

In 0.4.0 there is a new way for processors to indicate what they expect as
far as input, it can be required, allowed, or forbidden. This prevents
scenarios like ExecuteSQL which at one point required an input FlowFile,
but the processor could be running and started with out an incoming
connection, and it wasn't clear why it wasn't doing anything. So now if a
processor says input is required, then it will be considered invalid if
there are no incoming connections.

In the case of FetchS3, there is definitely intent to have a ListS3, but
there are still ways to use it with out that...

One scenario might be to retrieve the same bucket on a timer, maybe once a
day or every hour... this could be done with a GenerateFlowFile processor
scheduled to the appropriate interval and feeding into FetchS3, with
FetchS3 containing a hard-coded bucket id. GenerateFlowFile would be acting
as the trigger here.

A second scenario might be to receive messages from somewhere else (JMS,
Kafka, HTTP) which contain bucket ids, and feed these FlowFiles into
FetchS3. The bucket id on FetchS3 could use expression language to
reference a bucket id on the incoming FlowFile.

Hope this helps.

-Bryan

On Tue, Jan 12, 2016 at 9:39 PM, Corey Flowers 
wrote:

> Hello Russell,
>
>Sorry if that seemed short, I was running in to pick my son up
> from  my practice. What I meant to say was that you are correct. Although I
> haven't worked on those processors, I do believe it is expecting the listS3
> processor to function and that is why you are getting that error. I would
> love to know if there is a work around because I would also love to work
> with these processors within the next week or so.
>
> On Tue, Jan 12, 2016 at 9:16 PM, Russell Whitaker <
> russell.whita...@gmail.com> wrote:
>
>> On Tue, Jan 12, 2016 at 6:11 PM, Corey Flowers 
>> wrote:
>> > Ha ha! Well that would do it! :)
>> >
>>
>> I don't know what that would "do" other than confirm that the
>> FetchS3Object processor shipped
>> with v0.4.1 needs its doc to reflect the fact it's not yet useable
>> untl a ListS3* processor is implemented
>> and included in the narfile for the distribution.
>>
>> Russell
>>
>> > Sent from my iPhone
>> >
>> >> On Jan 12, 2016, at 9:10 PM, Russell Whitaker <
>> russell.whita...@gmail.com> wrote:
>> >>
>> >>> On Tue, Jan 12, 2016 at 6:02 PM, Corey Flowers <
>> cflow...@onyxpoint.com> wrote:
>> >>> I haven't worked with this processor but I believe it is looking for
>> >>> the S3 list processor to generate the list of objects to fetch. Did
>> >>> you try that yet?
>> >>
>> >> I mentioned this: "There's no "ListS3Object" processor type which
>> >> might hypothetically populate
>> >> attributes for FetchS3Object to act upon." I should have made this
>> >> doubly explicit that I checked
>> >> in the processor creation dialogue.
>> >>
>> >> Also, this:
>> >>
>> https://mail-archives.apache.org/mod_mbox/nifi-users/201510.mbox/%3cd23c06e8.ca0%25chakrader.dewaraga...@lifelock.com%3E
>> >>
>> >> "There is already a ticket
>> >> (NIFI-840)
>> >> in the hopper to create a ListS3Objects processor that can track
>> >> bucket contents and trigger
>> >> FetchS3Object."
>> >>
>> >> Oh god, it does appear that v0.4.1 ships with an implemented
>> >> FetchS3Object processor but no
>> >> List processor to feed it:
>> >>
>> >> https://issues.apache.org/jira/browse/NIFI-840
>> >>
>> >> Status: unresolved
>> >>
>> >> Description: "A processor is needed that can provide an S3 listing to
>> >> use in conjunction with FetchS3Object. This is to provide a similar
>> >> user experience as with the HDFS processors that perform List/Get."
>> >>
>> >> I think this means I'm horked. And the Relationships section of the
>> >> FetchS3Object doc is still wrong.
>> >>
>> >> Russell
>> >>
>> >>
>> >>> Sent from my iPhone
>> >>>
>>  On Jan 12, 2016, at 8:38 PM, Russell Whitaker <
>> russell.whita...@gmail.com> wrote:
>> 
>>  I'm running v0.4.1 Nifi, and seeing this (taken from nifi-app.log,
>>  also seeing on mouseover of the "!" icon on the processor on the
>>  canvas):
>> 
>>  2016-01-12 17:08:50,357 ERROR [NiFi Web Server-18]
>>  o.a.nifi.groups.StandardProcessGroup Unable to start
>>  FetchS3Object[id=f4253204-a2e2-4ce6-ba09-9415e8024dca] due to {}
>>  java.lang.IllegalStateException: Processor FetchS3Object is not in a
>>  valid state due to ['Upstream Connections' is invalid because
>>  Processor requires an upstream connection but currently has none]
>> 
>>  Per:
>> 
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.aws.s3.FetchS3Object/index.html
>> ,
>>  FetchS3Object "Retrieves the contents of an S3 Object and writes it
>> to
>>  the content of a FlowFile," which would seem to indicate this is an
>>  "edge" processor that doesn't expect a flowfile from an

Re: How does GetSFTP treat files being loaded?

2016-06-03 Thread Bryan Bende

Adding to what Oleg said...

GetSFTP has a property called Ignore Dotted Files which defaults to true
and tells GetSFTP to ignore filenames that begin with dots which can help
in this scenario if the uploader can upload to temporary file starting with
a dot and rename when done.

-Bryan


On Fri, Jun 3, 2016 at 11:10 AM, Oleg Zhurakousky <
ozhurakou...@hortonworks.com> wrote:

> Huagen
>
> Just to clarify. There isn’t really an SFTP server. SFTP is just an FTP
> layer over SSH - allowing you to get FTP-like experience when accessing
> file system over SSH.
> Now, tis your question.
> I am assuming you are implying that some other system is uploading a large
> file. In any event a typical and well established pattern for
> upload/download is to use temporary name until upload/download has finished
> and then rename. This way the consuming system doesn’t see
> “work-in-progress” until it is finished. In other words  the consuming
> system will never see that file until its fully uploaded.
>
> Hope this helps.
> Cheers
> Oleg
>
> > On Jun 3, 2016, at 10:59 AM, Huagen peng  wrote:
> >
> > Hi,
> >
> > I need to get files from a SFTP server and then remove the files
> afterward.  GetSFTP seems to be the processor to use.  If a user uploads a
> large file, say 20G, to the server and the GetSFTP processor happens to be
> running in the middle of the uploading, what is the expected behavior.
> Does the processor pick up the file? If so, can the processor get the
> entire file and then safely delete it? Does anybody has experience on that?
> >
> > Huagen
>
>

Re: Custom processor is failing for concurrency

2016-06-03 Thread Bryan Bende

It is hard to say for sure, but I think your NiFi processor is generally ok
regarding thread safety, but I think there could be a problem in the Azure
SDK code...

RequestFactory has an instance of BaseUrl and every time
RequestFactory.create() is called, it calls BaseUrl.url().

The implementation of BaseUrl is the following (according to my IntelliJ
attaching the sources...):

public class AutoRestBaseUrl implements BaseUrl {
/** A template based URL with variables wrapped in {}s. */
private String template;
/** a mapping from {} wrapped variables in the template and their actual
values. */
private Map<CharSequence, String> mappings;

@Override
public HttpUrl url() {
String url = template;
for (Map.Entry<CharSequence, String> entry : mappings.entrySet()) {
url = url.replace(entry.getKey(), entry.getValue());
}
mappings.clear();
return HttpUrl.parse(url);
}

/**
* Creates an instance of a template based URL.
*
* @param url the template based URL to use.
*/
public AutoRestBaseUrl(String url) {
this.template = url;
this.mappings = new HashMap<>();
}

/**
* Sets the value for the {} wrapped variables in the template URL.
* @param matcher the {} wrapped variable to replace.
* @param value the value to set for the variable.
*/
public void set(CharSequence matcher, String value) {
this.mappings.put(matcher, value);
}
}

The exception is coming from the line where it is looping over the entryset:

for (Map.Entry<CharSequence, String> entry : mappings.entrySet()) {

Right after that loop it calls mappings.clear() so if the RequestFactory is
shared by multiple threads (which I think it is), then one thread could be
iterating over the set, which another calls mappings.clear().


On Fri, Jun 3, 2016 at 5:09 PM, Oleg Zhurakousky <
ozhurakou...@hortonworks.com> wrote:

> Kumiko
>
> It appears that the current state of the source you linked in is not in
> sync with what is in the stack trace. Perhaps you have made some code
> modifications (e.g., line 218 is an empty line in code while it has a
> pointer in the star trace).
> In any event, from what I can see the error is coming from Azure libraries
> (not NiFi). Specifically ‘com.microsoft.rest.AutoRestBaseUrl.url(..)’
> seems to be doing some iteration where I presume the remove is called.
> Perhaps it is not a thread safe class after all. What does Microsoft
> documentation says? Have you looked at the source to see what’s going on
> there? If its open please link and we can tale a look.
>
> Cheers
> Oleg
>
> On Jun 3, 2016, at 4:58 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:
>
> Here is the code, https://github.com/kyada1/PutFileAzureDLStore.
>
> Thanks
> Kumiko
>
> *From:* Bryan Bende [mailto:bbe...@gmail.com <bbe...@gmail.com>]
> *Sent:* Friday, June 3, 2016 12:57 PM
> *To:* users@nifi.apache.org
> *Subject:* Re: Custom processor is failing for the custom processor
>
> Hello,
>
> Would you be able to share your code for PutFileAzureDLStore so we can
> help identify if there is a concurrency problem?
>
> -Bryan
>
> On Fri, Jun 3, 2016 at 3:39 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:
>
> Hello,
>
> I wrote the following custom service control and processor.  When the
> custom processor is running concurrently, it’s failing often with several
> different errors.  Are there any special handlings for concurrently that I
> need to add in the custom processor?  I wrote the sample Java program which
> does the same thing as the custom processor (authenticate every time the
> file is created/create a file, create 2 threads and run concurrently), it’s
> working fine.  The custom processor also fine when this is not running
> concurrently.
>
> *Custom service control – set the properties for the Microsoft Azure
> Datalake Store*
> *Custom processor – authenticate, then create a file in Microsoft Azure
> Datalake Store*
>
> Error1:
> 2016-06-03 12:29:31,942 INFO [pool-2815-thread-1]
> c.m.aad.adal4j.AuthenticationAuthority [Correlation ID:
> 64c89876-2f7b-4fbb-8e6f-fb395a47aa87] Instance discovery was successful
> 2016-06-03 12:29:31,946 ERROR [Timer-Driven Process Thread-10]
> n.a.d.processors.PutFileAzureDLStore
> java.util.ConcurrentModificationException: null
> at
> java.util.HashMap$HashIterator.nextNode(HashMap.java:1429) ~[na:1.8.0_77]
> at java.util.HashMap$EntryIterator.next(HashMap.java:1463)
> ~[na:1.8.0_77]
> at java.util.HashMap$EntryIterator.next(HashMap.java:1461)
> ~[na:1.8.0_77]
> at
> com.microsoft.rest.AutoRestBaseUrl.url(AutoRestBaseUrl.java:28) ~[na:na]
> at retrofit2.RequestFactory.create(RequestFactory.java:50)
> ~[na:na]
> at retrofit2.OkHttpCall.createRawCall

Re: Kerberos / NiFi 0.5.1 / ListHDFS

2016-06-08 Thread Bryan Bende

Hi Michael,

Can you confirm that the ListHDFS that is showing the problem is pointing
at a core-site.xml that has the value of "hadoop.security.authentication"
set to "kerberos" ?

-Bryan


On Wed, Jun 8, 2016 at 12:22 PM, Michael Dyer 
wrote:

> I'm looking for any additional steps that I could take to troubleshoot the
> problem below.
>
> I have two ListHDFS processors, both using Kerberos, each pointing to a
> different HDFS system.
>
> My first process works perfectly, but my second processor is throwing the
> following error:
>
> 15:53:41 UTC ERROR aef493a8-5790-4b21-8155-16d2152ab334
> ListHDFS[id=aef493a8-5790-4b21-8155-16d2152ab334] Failed to perform
> listing of HDFS due to org.apache.hadoop.security.AccessControlException:
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]:
> org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
> not enabled.  Available:[TOKEN, KERBEROS]
>
> The stack crawl in the log file doesn't provide much more info:
>
> 5:53:41 UTC ERROR aef493a8-5790-4b21-8155-16d2152ab334
> ListHDFS[id=aef493a8-5790-4b21-8155-16d2152ab334] Failed to perform
> listing of HDFS due to org.apache.hadoop.security.AccessControlException:
> SIMPLE authentication is not enabled.  Available:[TOKEN, KERBEROS]:
> org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
> not enabled.  Available:[TOKEN, KERBEROS]
>
> 2016-06-08 15:53:41,599 ERROR [Timer-Driven Process Thread-2]
> o.apache.nifi.processors.hadoop.ListHDFS
> ListHDFS[id=aef493a8-5790-4b21-8155-16d2152ab334] Failed to perform listing
> of HDFS due to org.apache.hadoop.security.AccessControlException: SIMPLE
> authentication is not enabled.  Available:[TOKEN, KERBEROS]:
> org.apache.hadoop.security.AccessControlException: SIMPLE authentication is
> not enabled.  Available:[TOKEN, KERBEROS]
> 2016-06-08 15:53:41,600 INFO [NiFi Web Server-6415]
> o.a.n.c.s.TimerDrivenSchedulingAgent Stopped scheduling
> ListHDFS[id=aef493a8-5790-4b21-8155-16d2152ab334] to run
> 2016-06-08 15:53:41,604 ERROR [Timer-Driven Process Thread-2]
> o.apache.nifi.processors.hadoop.ListHDFS
> org.apache.hadoop.security.AccessControlException: SIMPLE authentication
> is not enabled.  Available:[TOKEN, KERBEROS]
> at sun.reflect.GeneratedConstructorAccessor328.newInstance(Unknown
> Source) ~[na:na]
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ~[na:1.8.0_74]
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> ~[na:1.8.0_74]
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
> ~[hadoop-common-2.6.2.jar:na]
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
> ~[hadoop-common-2.6.2.jar:na]
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1960)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:1941)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:693)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:105)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:755)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$15.doCall(DistributedFileSystem.java:751)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> ~[hadoop-common-2.6.2.jar:na]
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:751)
> ~[hadoop-hdfs-2.6.2.jar:na]
> at
> org.apache.nifi.processors.hadoop.ListHDFS.getStatuses(ListHDFS.java:342)
> ~[nifi-hdfs-processors-0.5.1.jar:0.5.1]
> at
> org.apache.nifi.processors.hadoop.ListHDFS.onTrigger(ListHDFS.java:270)
> ~[nifi-hdfs-processors-0.5.1.jar:0.5.1]
> at
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> [nifi-api-0.5.1.jar:0.5.1]
> at
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1139)
> [nifi-framework-core-0.5.1.jar:0.5.1]
> at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:139)
> [nifi-framework-core-0.5.1.jar:0.5.1]
> at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:49)
> [nifi-framework-core-0.5.1.jar:0.5.1]
> at
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:124)
> [nifi-framework-core-0.5.1.jar:0.5.1]
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_74]
>

Re: Custom processor is failing for concurrency

2016-06-09 Thread Bryan Bende

Kumiko,

In general you shouldn't have to create threads in your processors, with
the exception of some special cases.
The framework has a thread pool and it takes one of those threads and calls
the onTrigger method of your processor.

If you want multiple threads to call onTrigger, then each processor has a
Concurrent Tasks property in the UI on the scheduling tab,
which equates to the number of threads that will concurrently call
onTrigger.

A processor developer needs to only worry about the business logic in the
onTrigger method, and needs to ensure
thread-safe access to any member variables or state stored in the processor.

Hope that helps.

-Bryan


On Thu, Jun 9, 2016 at 2:11 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:

> Microsoft found this is an issue with the SDK, they are working on a fix,
> they do not have the ETA for the fix.  To workaround this issue, I’m trying
> to create the multiple threads in using AbstractSessionFactoryProcessor and
> handle the Create a file in a single thread.   I’m having a problem that
> the single thread is not working correctly.  The processor is still acting
> like a single thread.
>
>
>
> When I create a thread to handle the create a file, do I have to call this
> method using java.util.concurrent.ExecutorService?
>
>
>
> Are there any sample processors that I can take a look?
>
>
>
> Thanks
>
> Kumiko
>
>
>
> *From:* Kumiko Yada [mailto:kumiko.y...@ds-iq.com]
> *Sent:* Sunday, June 5, 2016 6:28 PM
> *To:* users@nifi.apache.org
> *Cc:* Ki Kang <ki.k...@ds-iq.com>; Kevin Verhoeven <
> kevin.verhoe...@ds-iq.com>
> *Subject:* RE: Custom processor is failing for concurrency
>
>
>
> Thank you, Bryan.  I’m working with Microsoft on this issue.  Will keep
> you guys updated.
>
>
>
> Thanks
>
> Kumiko
>
>
>
> *From:* Bryan Bende [mailto:bbe...@gmail.com <bbe...@gmail.com>]
> *Sent:* Friday, June 3, 2016 2:32 PM
> *To:* users@nifi.apache.org
> *Subject:* Re: Custom processor is failing for concurrency
>
>
>
> It is hard to say for sure, but I think your NiFi processor is generally
> ok regarding thread safety, but I think there could be a problem in the
> Azure SDK code...
>
>
>
> RequestFactory has an instance of BaseUrl and every time
> RequestFactory.create() is called, it calls BaseUrl.url().
>
>
>
> The implementation of BaseUrl is the following (according to my IntelliJ
> attaching the sources...):
>
>
>
> public class AutoRestBaseUrl implements BaseUrl {
> /** A template based URL with variables wrapped in {}s. */
> private String template;
> /** a mapping from {} wrapped variables in the template and their actual
> values. */
> private Map<CharSequence, String> mappings;
>
> @Override
> public HttpUrl url() {
> String url = template;
> for (Map.Entry<CharSequence, String> entry : mappings.entrySet()) {
> url = url.replace(entry.getKey(), entry.getValue());
> }
> mappings.clear();
> return HttpUrl.parse(url);
> }
>
> /**
> * Creates an instance of a template based URL.
> *
> * @param url the template based URL to use.
> */
> public AutoRestBaseUrl(String url) {
> this.template = url;
> this.mappings = new HashMap<>();
> }
>
> /**
> * Sets the value for the {} wrapped variables in the template URL.
> * @param matcher the {} wrapped variable to replace.
> * @param value the value to set for the variable.
> */
> public void set(CharSequence matcher, String value) {
> this.mappings.put(matcher, value);
> }
> }
>
>
>
> The exception is coming from the line where it is looping over the
> entryset:
>
>
>
> for (Map.Entry<CharSequence, String> entry : mappings.entrySet()) {
>
>
>
> Right after that loop it calls mappings.clear() so if the RequestFactory
> is shared by multiple threads (which I think it is), then one thread could
> be iterating over the set, which another calls mappings.clear().
>
>
>
>
>
> On Fri, Jun 3, 2016 at 5:09 PM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
>
> Kumiko
>
>
>
> It appears that the current state of the source you linked in is not in
> sync with what is in the stack trace. Perhaps you have made some code
> modifications (e.g., line 218 is an empty line in code while it has a
> pointer in the star trace).
>
> In any event, from what I can see the error is coming from Azure libraries
> (not NiFi). Specifically ‘com.microsoft.rest.AutoRestBaseUrl.url(..)’ seems
> to be doing some iteration where I presume the remove is called. Perhaps it
> is not a thread safe class after all. What does Microsoft documentation
> says? Have you looked at the so

Re: Custom processor is failing for concurrency

2016-06-09 Thread Bryan Bende

Yes I think it ends up being the same thing.

If you create multiple threads that all use the
same DataLakeStoreFileSystemManagementClient,
or if you increase concurrent threads > 1 in the UI, both will potentially
run into the problem in the Microsoft SDK.

On Thu, Jun 9, 2016 at 2:49 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:

> Hi Bryan,
>
> Does this mean that even I create the multiple threads in onTriger, I will
> still hit the Microsoft SDK issue where it's not a thread safe?  Sounds
> like basically what I am trying to do and creating the  multiple threads
> via UI might be the same thing.
>
> Thanks
> Kumiko
> ------
> *From:* Bryan Bende <bbe...@gmail.com>
> *Sent:* Thursday, June 9, 2016 11:26:10 AM
> *To:* users@nifi.apache.org
> *Cc:* Kevin Verhoeven; Ki Kang
>
> *Subject:* Re: Custom processor is failing for concurrency
>
> Kumiko,
>
> In general you shouldn't have to create threads in your processors, with
> the exception of some special cases.
> The framework has a thread pool and it takes one of those threads and
> calls the onTrigger method of your processor.
>
> If you want multiple threads to call onTrigger, then each processor has a
> Concurrent Tasks property in the UI on the scheduling tab,
> which equates to the number of threads that will concurrently call
> onTrigger.
>
> A processor developer needs to only worry about the business logic in the
> onTrigger method, and needs to ensure
> thread-safe access to any member variables or state stored in the
> processor.
>
> Hope that helps.
>
> -Bryan
>
>
> On Thu, Jun 9, 2016 at 2:11 PM, Kumiko Yada <kumiko.y...@ds-iq.com> wrote:
>
>> Microsoft found this is an issue with the SDK, they are working on a fix,
>> they do not have the ETA for the fix.  To workaround this issue, I’m trying
>> to create the multiple threads in using AbstractSessionFactoryProcessor and
>> handle the Create a file in a single thread.   I’m having a problem that
>> the single thread is not working correctly.  The processor is still acting
>> like a single thread.
>>
>>
>>
>> When I create a thread to handle the create a file, do I have to call
>> this method using java.util.concurrent.ExecutorService?
>>
>>
>>
>> Are there any sample processors that I can take a look?
>>
>>
>>
>> Thanks
>>
>> Kumiko
>>
>>
>>
>> *From:* Kumiko Yada [mailto:kumiko.y...@ds-iq.com]
>> *Sent:* Sunday, June 5, 2016 6:28 PM
>> *To:* users@nifi.apache.org
>> *Cc:* Ki Kang <ki.k...@ds-iq.com>; Kevin Verhoeven <
>> kevin.verhoe...@ds-iq.com>
>> *Subject:* RE: Custom processor is failing for concurrency
>>
>>
>>
>> Thank you, Bryan.  I’m working with Microsoft on this issue.  Will keep
>> you guys updated.
>>
>>
>>
>> Thanks
>>
>> Kumiko
>>
>>
>>
>> *From:* Bryan Bende [mailto:bbe...@gmail.com <bbe...@gmail.com>]
>> *Sent:* Friday, June 3, 2016 2:32 PM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: Custom processor is failing for concurrency
>>
>>
>>
>> It is hard to say for sure, but I think your NiFi processor is generally
>> ok regarding thread safety, but I think there could be a problem in the
>> Azure SDK code...
>>
>>
>>
>> RequestFactory has an instance of BaseUrl and every time
>> RequestFactory.create() is called, it calls BaseUrl.url().
>>
>>
>>
>> The implementation of BaseUrl is the following (according to my IntelliJ
>> attaching the sources...):
>>
>>
>>
>> public class AutoRestBaseUrl implements BaseUrl {
>> /** A template based URL with variables wrapped in {}s. */
>> private String template;
>> /** a mapping from {} wrapped variables in the template and their actual
>> values. */
>> private Map<CharSequence, String> mappings;
>>
>> @Override
>> public HttpUrl url() {
>> String url = template;
>> for (Map.Entry<CharSequence, String> entry : mappings.entrySet()) {
>> url = url.replace(entry.getKey(), entry.getValue());
>> }
>> mappings.clear();
>> return HttpUrl.parse(url);
>> }
>>
>> /**
>> * Creates an instance of a template based URL.
>> *
>> * @param url the template based URL to use.
>> */
>> public AutoRestBaseUrl(String url) {
>> this.template = url;
>> this.mappings = new HashMap<>();
>> }
>>
>> /**
>> * Sets the value for the {} wrapped variables in the template URL.
>> *

Re: Dependency for SSL

2016-05-25 Thread Bryan Bende

Hello,

This Wiki page shows how to setup the dependencies to use the
SSLContextService from a custom processor bundle:

https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions#MavenProjectsforExtensions-LinkingProcessorsandControllerServices

Thanks,

Bryan


On Wed, May 25, 2016 at 3:44 PM, Kumiko Yada  wrote:

> Hello,
>
>
>
> I’d like to use the following in the custom processor in Nifi 0.6.1, and I
> added the dependency.  However, I’m getting the following errors.  Do I
> need to add something else to correctly handle this dependency?
>
>
>
> import org.apache.nifi.ssl.SSLContextService;
>
> import org.apache.nifi.ssl.SSLContextService.ClientAuth;
>
>
>
> 
>
> org.apache.nifi
>
> nifi-ssl-context-service-api
>
> 
>
>
>
> Error:
>
> at java.util.ServiceLoader.fail(ServiceLoader.java:232)
> ~[na:1.8.0_77]
>
> at
> java.util.ServiceLoader.access$100(ServiceLoader.java:185) ~[na:1.8.0_77]
>
> at
> java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
> ~[na:1.8.0_77]
>
> at
> java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
> ~[na:1.8.0_77]
>
> at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
> ~[na:1.8.0_77]
>
> at
> org.apache.nifi.nar.ExtensionManager.loadExtensions(ExtensionManager.java:107)
> ~[nifi-nar-utils-0.6.1.jar:0.6.1]
>
> at
> org.apache.nifi.nar.ExtensionManager.discoverExtensions(ExtensionManager.java:88)
> ~[nifi-nar-utils-0.6.1.jar:0.6.1]
>
> at org.apache.nifi.NiFi.(NiFi.java:120)
> ~[nifi-runtime-0.6.1.jar:0.6.1]
>
> at org.apache.nifi.NiFi.main(NiFi.java:227)
> ~[nifi-runtime-0.6.1.jar:0.6.1]
>
> Caused by: java.lang.NoClassDefFoundError:
> org/apache/nifi/ssl/SSLContextService
>
> at
> nifi.processors.http.looper.InvokeHTTPLooper.(InvokeHTTPLooper.java:200)
> ~[nifi-http.looper-processors-1.0.jar:1.0]
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> ~[na:1.8.0_77]
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> ~[na:1.8.0_77]
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> ~[na:1.8.0_77]
>
> at
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> ~[na:1.8.0_77]
>
> at java.lang.Class.newInstance(Class.java:442)
> ~[na:1.8.0_77]
>
> at
> java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
> ~[na:1.8.0_77]
>
> ... 6 common frames omitted
>
> Caused by: java.lang.ClassNotFoundException:
> org.apache.nifi.ssl.SSLContextService
>
> at
> java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[na:1.8.0_77]
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> ~[na:1.8.0_77]
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ~[na:1.8.0_77]
>
>
>
> Thanks
>
> Kumiko
>

Re: Replace Text

2016-06-14 Thread Bryan Bende

What is the error you are getting from TransformXML?

On Tue, Jun 14, 2016 at 9:38 AM, Anuj Handa  wrote:

> anybody has any thoughts on UTF 8 Flow files with XMLtransforemation and
> other processors ?
>
> Anuj
>
> On Mon, Jun 13, 2016 at 4:45 PM, Anuj Handa  wrote:
>
>> So it seems like its a UTF-8 issue, when i changed the string to use Hex
>> instead of Text and using the HEXcode with 00 (2 BYte) the contentsplit
>> worked.
>>
>> > translates into following Hex code
>>
>> *3c0050004f0053005400720061006e00730061006300740069006f006e00200078006d006c006e007300*
>>
>> the transformXML is now failing i think because of the UTF-8. I know i
>> had it working in normal ascii file.
>>
>> Do i need to specify someplace the flow files are UTF-8 or is it smart
>> enough to figure it out on its own ?
>> based on some reading i see that some processors expect UTF-8 so the next
>> question would be do all processors support UTF 8 ?
>>
>> Anuj
>>
>>
>>
>> On Mon, Jun 13, 2016 at 3:01 PM, Anuj Handa  wrote:
>>
>>> thanks Joe, unfortunately since my xml has namespaces (xmlns )  that
>>> approach wont work.
>>> any thought on why spilt doesn't work using the tag, does it accept UTF8
>>> flow files ?
>>>
>>> Anuj
>>>
>>> On Mon, Jun 13, 2016 at 2:50 PM, ski n  wrote:
>>>
 You can also make your input XML well-formed by creating a custom root
 element (e.g. ...xmldocuments
  and then use the SplitXML processor (or just the transformation step).

 2016-06-13 18:04 GMT+02:00 Anuj Handa :

> i have a text file which has multiple XML documents. which starts with 
>  xmlns
> i am trying to break each one of the XML docs into 1 flow-file so i
> can then use evaluate XML and then convert into JSOn and then load into a
> database.
>
> i tried just the split content and that didnt work. the file is UTF 8
> not sure if that plays into it. and i am running the nifi on linux and the
> file is also local on linux.
>
> [image: Inline image 1]
>
> this is my entire workflow.
>
> [image: Inline image 2]
>
>
> On Mon, Jun 13, 2016 at 11:43 AM, Joe Percivall <
> joeperciv...@yahoo.com> wrote:
>
>> Awesome, and what processor were you planning to use to split on
>> "#|#|#"? The SplitContent processor[1] can be used to split the content 
>> on
>> a sequence of text characters which could split on "> xmlns"
>> without needing to add "#|#|#".
>>
>> Also I see "xmlns" and think this is an xml file you are trying to
>> split. If so are you by chance trying to split evenly on each child? If 
>> so
>> the "SplitXml" processor[2] would easily take care of that.
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitContent/index.html
>> [2]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.SplitXml/index.html
>>
>> Joe- - - - - -
>> Joseph Percivall
>> linkedin.com/in/Percivall
>> e: joeperciv...@yahoo.com
>>
>>
>>
>>
>> On Monday, June 13, 2016 11:26 AM, Anuj Handa 
>> wrote:
>> Yes that's exactly correct.
>>
>>
>> > On Jun 13, 2016, at 11:14 AM, Joe Percivall 
>> wrote:
>> >
>> > Sorry I got a bit confused, in your original question you said that
>> you wanted to append the value and I took it that you just wanted to 
>> append
>> the value to the end of the line or text.
>> >
>> > Let me try and restate your goal so I'm sure I understand,
>> ultimately you want to split the incoming FlowFile on each occurrence of
>> "> "#|#|#" before each occurrence so that it will be easy to split?
>> >
>> >
>> > Joe
>> > - - - - - -
>> > Joseph Percivall
>> > linkedin.com/in/Percivall
>> > e: joeperciv...@yahoo.com
>> >
>> >
>> >
>> > On Monday, June 13, 2016 11:05 AM, Anuj Handa 
>> wrote:
>> >
>> >
>> >
>> > Anuj
>> > Hi Joe,
>> >
>> > I modified the process per your suggestion but it only works to
>> replace the first occurrence, There are multiple such tags which it 
>> doesn't
>> replace. .
>> > when i used evaluation mode line by line it appended it to every
>> line in the file and not to the one i waned too.
>> >
>> >
>> >
>> >
>> > On Mon, Jun 13, 2016 at 10:40 AM, Joe Percivall <
>> joeperciv...@yahoo.com> wrote:
>> >
>> > Hello,
>> >>
>> >> In order to use ReplaceText[1] to solely append a value to the end
>> of then entire text then change the "Replacement Strategy" to "Append" 
>> and
>> leave "Evaluation Mode" as "Entire  Text". This will take whatever is the
>>

Re: NoClassDefFoundError: org/slf4j/LoggerFactory

2016-06-13 Thread Bryan Bende

Hi Donald,

I know this does not directly address the conflict in dependencies, but I
wanted to mention that it is not required to inherit from nifi-nar-bundles.

The Maven archetype does that by default, but you can certainly remove it,
there are some instructions on how to do so here:
https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions#MavenProjectsforExtensions-Inheritance

Once you add the build section with the NAR plugin, then you can go back to
nar.

-Bryan

On Mon, Jun 13, 2016 at 10:12 AM, Dr. Donald Leonhard-MacDonald <
don...@lmdventures.de> wrote:

> Hi Oleg,
>
> Thank you for your quick response. Unfortunately it’s still not working.
>
> I had tried these but had left them out by mistake when I’d tried mutation
> coding to get it all running. :-)
>
> I have added them in now so it can be tested. I’m still getting the same
> error.
>
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
> at NifiLogging.processors.LoggingProcessor.App.main(App.java:12)
> ... 6 more
> Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 7 more
>
>
> I think it’s a problem with a conflict in
>
> 
> org.apache.nifi
> nifi-nar-bundles
> 0.6.1
> 
> 
>
> When I remove it and change
>
> nifi-LoggingProcessor-nar
> 1.0
> nar
>
> To
>
> nifi-LoggingProcessor-nar
> 1.0
> jar
>
> it works.
>
> Any ideas?
>
> Cheers,
>
> Donald
>
> On 13 Jun 2016, at 14:01, Oleg Zhurakousky 
> wrote:
>
> Donald
>
> What I see is that SLF4J is nowhere in your claspath, so by adding it to
> your pom should solve the issue. Also don’t forget to add specific bindings
> (e.g., log4j) and its implementation.
> Basically here is the example of what you need to have if you want to use
> Log4j:
>
> 
> org.slf4j
> slf4j-api
> 1.7.21
> 
> 
> org.slf4j
> slf4j-log4j12
> 1.7.21
> 
> 
> log4j
> log4j
> 1.2.17
> 
>
> Cheers
> Oleg
>
> On Jun 13, 2016, at 7:22 AM, Dr. Donald Leonhard-MacDonald <
> don...@lmdventures.de> wrote:
>
> Hi All,
>
> First let me say I’m really enjoying using Nifi, it’s a great project.
>
> I have been creating some Processors and they are working very well. Now I
> want to include processors that talk to RethinkDB.
> When I include the dependency in Maven I get an
> java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory Exception.
>
> I have been using the getLogger in the processors and that works really
> well.
>
> I thought it might be a versioning issue(as suggested on website), so
> tried different versions but with no success. Though I can
> use the log4j when I restrict the scope of the jar to test.
>
> So I created several small test frameworks that only imported Log4J and
> nothing else to try and see what I was doing wrong.
> As soon as I set the parent of my pom to nifi
>
> Here is a very small example that creates the error
>
> https://github.com/macd-ci0/nifilogging
>
> mvn compile
> mvn exec:java
>
> A
> [WARNING]
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:293)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
> at NifiLogging.processors.LoggingProcessor.App.main(App.java:12)
> ... 6 more
> Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 7 more
>
>
> When I remove the parent element:
> 
> NifiExcel
> ExcelProcessors
> 1.0
> 
> and change the target type from nar->jar in the pom.xml file in the -nar
> directory It works well.
> So the conflict is here.
>
> I get the same conflict. When I tried to create a copy similar to the pom
> for the nifi-kafka-bundle/nifi-kafka-processors which
> also uses Log4J,  I get the same problem.
>
> I have also tried logback,

Re: Writing files to MapR File system using putHDFS

2016-06-13 Thread Bryan Bende

I'm not sure if this would make a difference, but typically the
configuration resources would be the full paths to core-site.xml and
hdfs-site.xml. Wondering if using those instead of yarn-site.xml changes
anything.

On Monday, June 13, 2016, Ravi Papisetti (rpapiset) 
wrote:

> Yes, Aldrin. Tried listHDFS, gets the similar error complaining directory
> doesn't exist.
>
> NiFi – 0.6.1
> MapR – 5.1
>
> NiFi is local standalone instance. Target cluster is enabled with token
> based authentication. I am able to execute "hadoop fs –ls " from cli
> on the node with NiFi installed.
>
>
> Thanks,
>
> Ravi Papisetti
>
> Technical Leader
>
> Services Technology Incubation Center
> 
>
> rpapi...@cisco.com 
>
> Phone: +1 512 340 3377
>
>
> [image: stic-logo-email-blue]
>
> From: Aldrin Piri  >
> Reply-To: "users@nifi.apache.org
> " <
> users@nifi.apache.org
> >
> Date: Monday, June 13, 2016 at 6:24 PM
> To: "users@nifi.apache.org
> " <
> users@nifi.apache.org
> >
> Subject: Re: Writing files to MapR File system using putHDFS
>
> Hi Ravi,
>
> Could you provide some additional details in terms of both your NiFi
> environment and the MapR destination?
>
> Is your NiFi a single instance or clustered?  In the case of the latter,
> is security established for your ZooKeeper ensemble?
>
> Is your target cluster Kerberized?  What version are you running?  Have
> you attempted to use the List/GetHDFS processors?  Do they also have errors
> in reading?
>
> Thanks!
> --aldrin
>
> On Mon, Jun 13, 2016 at 5:19 PM, Ravi Papisetti (rpapiset) <
> rpapi...@cisco.com >
> wrote:
>
>> Thanks Conrad for your reply.
>>
>> Yes, I have configured putHDFS with "Remove Owner" and "Renive Group"
>> with same values as on HDFS. Also, nifi service is started under the same
>> user.
>>
>>
>>
>> Thanks,
>>
>> Ravi Papisetti
>>
>> Technical Leader
>>
>> Services Technology Incubation Center
>> 
>>
>> rpapi...@cisco.com 
>>
>> Phone: +1 512 340 3377
>>
>>
>> [image: stic-logo-email-blue]
>>
>> From: Conrad Crampton > >
>> Reply-To: "users@nifi.apache.org
>> " <
>> users@nifi.apache.org
>> >
>> Date: Monday, June 13, 2016 at 4:01 PM
>> To: "users@nifi.apache.org
>> " <
>> users@nifi.apache.org
>> >
>> Subject: Re: Writing files to MapR File system using putHDFS
>>
>> Hi,
>>
>> Sounds like a permissions problem. Have you set the Remote Owner and
>> Remote Groups settings in the processor appropriate for HDFS permissions?
>>
>> Conrad
>>
>>
>>
>> *From: *"Ravi Papisetti (rpapiset)" > >
>> *Reply-To: *"users@nifi.apache.org
>> " <
>> users@nifi.apache.org
>> >
>> *Date: *Monday, 13 June 2016 at 21:25
>> *To: *"users@nifi.apache.org
>> " <
>> users@nifi.apache.org
>> >, "
>> d...@nifi.apache.org "
>> > >
>> *Subject: *Writing files to MapR File system using putHDFS
>>
>>
>>
>> Hi,
>>
>>
>>
>> We just started exploring apache nifi for data onboarding into MapR
>> distribution. Have configured putHDFS with yarn-site.xml from on local mapr
>> client where cluster information is provided, configured the "Directory"
>> with mapr fs directory to write the files, configured nifi to run as user
>> has permission to write to mapr fs, inspie of that we are getting below
>> error while writing the file into given file system path. I am doubting,
>> nifi is not talking to the cluster or talking with wrong user, appreciate
>> if you some can guide me to troubleshoot this issue or any solutions if we
>> are doing something wrong:
>>
>> Nifi workflow is very simple: GetFile is configure to read from locla
>> file system, connected to PutHDFS with yarn-site.xml and directory
>> information configured.
>>
>> 2016-06-13 15:14:36,305 INFO [Timer-Driven Process Thread-2]
>>

Re: How to configure site-to-site communication between nodes in one cluster.

2016-06-02 Thread Bryan Bende

It really comes down to what works best for your use case

NiFi is not made to compete with distributed computation frameworks like
Spark, Map-Reduce, etc, its job is to bring data to them. So if you need to
run a computation across 100s-1000s of nodes, then you would do that in
Hadoop. NiFi clusters are usually around 10 nodes or less.

For ETL, NiFi can tackle some use-cases, but again, there are situations
where something like sqoop is going to be a better choice because its
specifically engineered for massive extraction from a database.

All that being said, it sounds like you haven't had a problem getting the
data out of the database with NiFi, is the transform part of your flow
taking longer than you expected? can you share more about what you are
doing to each record, and how many records?

-Brya



On Thu, Jun 2, 2016 at 5:19 AM, Yuri Nikonovich <utagai...@gmail.com> wrote:

> Hi
> Thank you, Bryan.
> I've built my pipeline like you've described with RPG to process splitted
> parts. The thing that concerns me is the approach to clustering with each
> node running complete flow separately from other nodes. This approach makes
> me think that Nifi isn't suited for heavy ETL processes running within its
> processors. Maybe it is better to use Nifi flow as an orchestration tool
> and do heavy work (like validation or transformation) with other tools
> (like Hadoop for example). For example Fetch data from DB -> SplitIntoAvro
> -> Send it to validation/transformation Hadoop Job -> get results back to
> Nifi -> do other things. What do you think of this approach?
>
> 2016-06-01 21:24 GMT+03:00 Bryan Bende <bbe...@gmail.com>:
>
>> NiFi is definitely suitable for processing large files, but NiFi's
>> clustering model works a little different than some of the distributed
>> processing frameworks people are used to.
>> In a NiFi cluster, each node runs the same flow/graph, and it is the data
>> that needs to be partitioned across the nodes. How to partition the data
>> really depends on the use-case (that is what the article I linked to was
>> all about).
>>
>> In your scenario there are a couple of ways to achieve parallelism...
>>
>> Process everything on the node that the HTTP requests comes in on, and
>> increase the Concurrent Tasks (# of threads) for the processors after
>> SplitAvro so that multiple chunks can be transformed and send to Cassandra
>> in parallel.
>> I am assuming the HTTP requests are infrequent and are acting as a
>> trigger for the process, but if they are frequent you could put a load
>> balancer in front of NiFi to distribute those requests across the nodes.
>>
>> The other option is to use the RPG redistribution technique to
>> redistribute the chunks across the cluster, can still adjust the Concurrent
>> Tasks on the processors to have each node doing more in parallel.
>> You would put SplitAvro -> RPG that points to itself, then somewhere else
>> on the flow there is an Input Port -> follow on processors, the RPG
>> connects to that Input Port.
>> The receive HTTP request would be set to run on Primary Node only.
>>
>> It will come down to which is faster... processing the chunks locally on
>> one node with multiple threads, or transferring the chunks across the
>> network and processing them on multiple nodes with multiple threads.
>>
>> On Wed, Jun 1, 2016 at 12:37 PM, Yuri Nikonovich <utagai...@gmail.com>
>> wrote:
>>
>>> Hello, Bryan
>>> Thanks for the answer.
>>> You've understood me correctly. What I'm trying to achieve is to put
>>> some validation on the dataset. So I fetch all data with one query from
>>> db(I can't change this behavior), then I use SplitAvro processor to split
>>> it into chunks say 1000 records each. After that I want to treat each chunk
>>> independently, transform each record in a chunk according to my domain
>>> model, validate it and save. This transform-load work I want to distribute
>>> across the cluster.
>>>
>>> While reading about Nifi I've haven't found any information about flows
>>> like mine. This fact worries me a little. Maybe I'm trying to do something
>>> that is not suitable for Nifi.
>>>
>>> Is Nifi a suitable tool for processing large files or I should not do
>>> actual processing work outside the Nifi flow?
>>>
>>> 2016-06-01 17:28 GMT+03:00 Bryan Bende <bbe...@gmail.com>:
>>>
>>>> Hello,
>>>>
>>>> This post [1] has a description of how to redistribute data with in the
>>>> same cluster. You are correct that it involves a RPG pointi

Re: Spark or custom processor?

2016-06-02 Thread Bryan Bende

Conrad,

I would think that you could do this all in NiFi.

How do the log files come into NiFi? TailFile, ListenUDP/ListenTCP,
List+FetchFile?

-Bryan


On Thu, Jun 2, 2016 at 6:41 AM, Conrad Crampton  wrote:

> Hi,
>
> Any advice on ‘best’ architectural approach whereby some processing
> function has to be applied to every flow file in a dataflow with some
> (possible) output based on flowfile content.
>
> e.g. inspect log files for specific ip then send message to syslog
>
>
>
> approach 1
>
> Spark
>
> Output port from NiFi -> Spark listens to that stream -> processes and
> outputs accordingly
>
> Advantages – scale spark job on Yarn, decoupled (reusable) from NiFi
>
> Disadvantages – adds complexity, decoupled from NiFi.
>
>
>
> Approach 2
>
> NiFi
>
> Custom processor -> PutSyslog
>
> Advantages – reuse existing NiFi processors/ capability, obvious flow
> (design intent)
>
> Disadvantages – scale??
>
>
>
> Any comments/ advice/ experience of either approaches?
>
>
>
> Thanks
>
> Conrad
>
>
>
>
>
>
> SecureData, combating cyber threats
>
> --
>
> The information contained in this message or any of its attachments may be
> privileged and confidential and intended for the exclusive use of the
> intended recipient. If you are not the intended recipient any disclosure,
> reproduction, distribution or other dissemination or use of this
> communications is strictly prohibited. The views expressed in this email
> are those of the individual and not necessarily of SecureData Europe Ltd.
> Any prices quoted are only valid if followed up by a formal written quote.
>
> SecureData Europe Limited. Registered in England & Wales 04365896.
> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
> Maidstone, Kent, ME16 9NT
>

Re: How to access the counter and provenance Info

2016-06-02 Thread Bryan Bende

Hi Kumiko,

I think you can only increment counters and report provenance events from
processors, but not query counters or query provenance.

If you are in a custom processor, could have an AtomicInteger counter that
increments on every onTrigger and reset when 24 hours has passed?

Thanks,

Bryan


On Thu, Jun 2, 2016 at 10:50 AM, Kumiko Yada  wrote:

> Could anyone help me on this?
>
> Thanks
> Kumiko
> --
> *From:* Kumiko Yada 
> *Sent:* Wednesday, June 1, 2016 4:22:25 PM
> *To:* users@nifi.apache.org
> *Subject:* RE: How to access the counter and provenance Info
>
>
> When I clicked the Data Provenance from context menu of processor, the
> item history are showed the NiFi Flow Data Provenance UI.  How can I get
> this item list from the custom processor?
>
>
>
> Thanks
>
> Kumiko
>
>
>
> *From:* Kumiko Yada [mailto:kumiko.y...@ds-iq.com]
> *Sent:* Wednesday, June 1, 2016 12:56 PM
> *To:* users@nifi.apache.org
> *Subject:* How to access the counter and provenance Info
>
>
>
> Hello,
>
>
>
> I’d like to get the how many times custom processor run past 24 hours in
> onTrigger method.  How can I get this using counter or provenance Info?
>
>
>
> Thanks
>
> Kumiko
>

Re: Spark or custom processor?

2016-06-02 Thread Bryan Bende

In addition to what Mark said, regarding getting the logs into NiFi...

I've found that when syslog servers forward messages over TCP, they
typically open a single connection, so in this case sending to the primary
node probably makes sense.

If you are forwarding over UDP, you might be able to stick a UDP load
balancer in front of of NiFi so that the logs are being routed to the
ListenSyslog on each node, and then you wouldn't need the RPG.

Just something to think about.

On Thu, Jun 2, 2016 at 9:42 AM, Mark Payne <marka...@hotmail.com> wrote:

> Conrad,
>
> Typically, the way that we like to think about using NiFi vs. something
> like Spark or Storm is whether
> the processing is Simple Event Processing or Complex Event Processing.
> Simple Event Processing
> encapsulates those tasks where you are able to operate on a single piece
> of data by itself (or in correlation
> with an Enrichment Dataset). So tasks like enrichment, splitting, and
> transformation are squarely within
> the wheelhouse of NiFi.
>
> When we talk about doing Complex Event Processing, we are generally
> talking about either processing data
> from multiple streams together (think JOIN operations) or analyzing data
> across time windows (think calculating
> norms, standard deviation, etc. over the last 30 minutes). The idea here
> is to derive a single new "insight" from
> windows of data or joined streams of data - not to transform or enrich
> individual pieces of data. For this, we would
> recommend something like Spark, Storm, Flink, etc.
>
> In terms of scalability, NiFi certainly was not designed to scale outward
> in the way that Spark was. With Spark you
> may be scaling to thousands of nodes, but with NiFi you would get a pretty
> poor user experience because each change
> in the UI must be replicated to all of those nodes. That being said, NiFi
> does scale up very well to take full advantage
> of however much CPU and disks you have available. We typically see
> processing of several terabytes of data per day
> on a single node, so we have generally not needed to scale out to hundreds
> or thousands of nodes.
>
> I hope this helps to clarify when/where to use each one. If there are
> things that are still unclear or if you have more
> questions, as always, don't hesitate to shoot another email!
>
> Thanks
> -Mark
>
>
> On Jun 2, 2016, at 9:28 AM, Conrad Crampton <conrad.cramp...@secdata.com>
> wrote:
>
> Hi,
> ListenSyslog (using the approach that is being discussed currently in
> another thread – ListenSyslog running on primary node as RGP, all other
> nodes connecting to the port that the RPG exposes).
> Various enrichment, routing on attributes etc. and finally into HDFS as
> Avro.
> I want to branch off at an appropriate point in the flow and do some
> further realtime analysis – got the output to port feeding to Spark process
> working fine (notwithstanding the issue that you have been so kind to help
> with previously with the SSLContext), just thinking about if this is most
> appropriate solution.
>
> I have dabbled with a custom processor (for enriching url splitting/
> enriching etc. – probably could have done with ExecuteScript processor in
> hindsight) so am comfortable with going this route if that is deemed more
> appropriate.
>
> Thanks
> Conrad
>
> *From: *Bryan Bende <bbe...@gmail.com>
> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
> *Date: *Thursday, 2 June 2016 at 13:12
> *To: *"users@nifi.apache.org" <users@nifi.apache.org>
> *Subject: *Re: Spark or custom processor?
>
> Conrad,
>
> I would think that you could do this all in NiFi.
>
> How do the log files come into NiFi? TailFile, ListenUDP/ListenTCP,
> List+FetchFile?
>
> -Bryan
>
>
> On Thu, Jun 2, 2016 at 6:41 AM, Conrad Crampton <
> conrad.cramp...@secdata.com> wrote:
>
> Hi,
> Any advice on ‘best’ architectural approach whereby some processing
> function has to be applied to every flow file in a dataflow with some
> (possible) output based on flowfile content.
> e.g. inspect log files for specific ip then send message to syslog
>
> approach 1
> Spark
> Output port from NiFi -> Spark listens to that stream -> processes and
> outputs accordingly
> Advantages – scale spark job on Yarn, decoupled (reusable) from NiFi
> Disadvantages – adds complexity, decoupled from NiFi.
>
> Approach 2
> NiFi
> Custom processor -> PutSyslog
> Advantages – reuse existing NiFi processors/ capability, obvious flow
> (design intent)
> Disadvantages – scale??
>
> Any comments/ advice/ experience of either approaches?
>
> Thanks
> Conrad
>
>
>
>
> Secu

Re: Spark or custom processor?

2016-06-02 Thread Bryan Bende

ex Event Processing.
>> Simple Event Processing
>>
>> encapsulates those tasks where you are able to operate on a single piece
>> of data by itself (or in correlation
>>
>> with an Enrichment Dataset). So tasks like enrichment, splitting, and
>> transformation are squarely within
>>
>> the wheelhouse of NiFi.
>>
>>
>>
>> When we talk about doing Complex Event Processing, we are generally
>> talking about either processing data
>>
>> from multiple streams together (think JOIN operations) or analyzing data
>> across time windows (think calculating
>>
>> norms, standard deviation, etc. over the last 30 minutes). The idea here
>> is to derive a single new "insight" from
>>
>> windows of data or joined streams of data - not to transform or enrich
>> individual pieces of data. For this, we would
>>
>> recommend something like Spark, Storm, Flink, etc.
>>
>>
>>
>> In terms of scalability, NiFi certainly was not designed to scale outward
>> in the way that Spark was. With Spark you
>>
>> may be scaling to thousands of nodes, but with NiFi you would get a
>> pretty poor user experience because each change
>>
>> in the UI must be replicated to all of those nodes. That being said, NiFi
>> does scale up very well to take full advantage
>>
>> of however much CPU and disks you have available. We typically see
>> processing of several terabytes of data per day
>>
>> on a single node, so we have generally not needed to scale out to
>> hundreds or thousands of nodes.
>>
>>
>>
>> I hope this helps to clarify when/where to use each one. If there are
>> things that are still unclear or if you have more
>>
>> questions, as always, don't hesitate to shoot another email!
>>
>>
>>
>> Thanks
>>
>> -Mark
>>
>>
>>
>>
>>
>> On Jun 2, 2016, at 9:28 AM, Conrad Crampton <conrad.cramp...@secdata.com>
>> wrote:
>>
>>
>>
>> Hi,
>>
>> ListenSyslog (using the approach that is being discussed currently in
>> another thread – ListenSyslog running on primary node as RGP, all other
>> nodes connecting to the port that the RPG exposes).
>>
>> Various enrichment, routing on attributes etc. and finally into HDFS as
>> Avro.
>>
>> I want to branch off at an appropriate point in the flow and do some
>> further realtime analysis – got the output to port feeding to Spark process
>> working fine (notwithstanding the issue that you have been so kind to help
>> with previously with the SSLContext), just thinking about if this is most
>> appropriate solution.
>>
>>
>>
>> I have dabbled with a custom processor (for enriching url splitting/
>> enriching etc. – probably could have done with ExecuteScript processor in
>> hindsight) so am comfortable with going this route if that is deemed more
>> appropriate.
>>
>>
>>
>> Thanks
>>
>> Conrad
>>
>>
>>
>> *From: *Bryan Bende <bbe...@gmail.com>
>> *Reply-To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Date: *Thursday, 2 June 2016 at 13:12
>> *To: *"users@nifi.apache.org" <users@nifi.apache.org>
>> *Subject: *Re: Spark or custom processor?
>>
>>
>>
>> Conrad,
>>
>>
>>
>> I would think that you could do this all in NiFi.
>>
>>
>>
>> How do the log files come into NiFi? TailFile, ListenUDP/ListenTCP,
>> List+FetchFile?
>>
>>
>>
>> -Bryan
>>
>>
>>
>>
>>
>> On Thu, Jun 2, 2016 at 6:41 AM, Conrad Crampton <
>> conrad.cramp...@secdata.com> wrote:
>>
>> Hi,
>>
>> Any advice on ‘best’ architectural approach whereby some processing
>> function has to be applied to every flow file in a dataflow with some
>> (possible) output based on flowfile content.
>>
>> e.g. inspect log files for specific ip then send message to syslog
>>
>>
>>
>> approach 1
>>
>> Spark
>>
>> Output port from NiFi -> Spark listens to that stream -> processes and
>> outputs accordingly
>>
>> Advantages – scale spark job on Yarn, decoupled (reusable) from NiFi
>>
>> Disadvantages – adds complexity, decoupled from NiFi.
>>
>>
>>
>> Approach 2
>>
>> NiFi
>>
>> Custom processor -> PutSyslog
>>
>> Advantages – reuse existing NiFi processors/ capability, obvious flow
>> (design intent)
>>
>> Disadvantages – scale??
>>
>>
>>
>> Any comments/ advice/ experience of either approaches?
>>
>>
>>
>> Thanks
>>
>> Conrad
>>
>>
>>
>>
>>
>>
>>
>> SecureData, combating cyber threats
>>
>>
>> --
>>
>> The information contained in this message or any of its attachments may
>> be privileged and confidential and intended for the exclusive use of the
>> intended recipient. If you are not the intended recipient any disclosure,
>> reproduction, distribution or other dissemination or use of this
>> communications is strictly prohibited. The views expressed in this email
>> are those of the individual and not necessarily of SecureData Europe Ltd.
>> Any prices quoted are only valid if followed up by a formal written quote.
>>
>> SecureData Europe Limited. Registered in England & Wales 04365896.
>> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
>> Maidstone, Kent, ME16 9NT
>>
>>
>>
>>
>>
>> ***This email originated outside SecureData***
>>
>> Click here
>> <https://www.mailcontrol.com/sr/0Yez0Z9rJiDGX2PQPOmvUr11KAWLA5a39FXrkhyyO4eQg2DXa9Xl!rwzg+4hlLPKdufvfzcRzpTaNxM9hG2QrA==>
>>  to report this email as spam.
>>
>>
>>
>>
>>
>
>

Re: How to configure site-to-site communication between nodes in one cluster.

2016-06-01 Thread Bryan Bende

NiFi is definitely suitable for processing large files, but NiFi's
clustering model works a little different than some of the distributed
processing frameworks people are used to.
In a NiFi cluster, each node runs the same flow/graph, and it is the data
that needs to be partitioned across the nodes. How to partition the data
really depends on the use-case (that is what the article I linked to was
all about).

In your scenario there are a couple of ways to achieve parallelism...

Process everything on the node that the HTTP requests comes in on, and
increase the Concurrent Tasks (# of threads) for the processors after
SplitAvro so that multiple chunks can be transformed and send to Cassandra
in parallel.
I am assuming the HTTP requests are infrequent and are acting as a trigger
for the process, but if they are frequent you could put a load balancer in
front of NiFi to distribute those requests across the nodes.

The other option is to use the RPG redistribution technique to redistribute
the chunks across the cluster, can still adjust the Concurrent Tasks on the
processors to have each node doing more in parallel.
You would put SplitAvro -> RPG that points to itself, then somewhere else
on the flow there is an Input Port -> follow on processors, the RPG
connects to that Input Port.
The receive HTTP request would be set to run on Primary Node only.

It will come down to which is faster... processing the chunks locally on
one node with multiple threads, or transferring the chunks across the
network and processing them on multiple nodes with multiple threads.

On Wed, Jun 1, 2016 at 12:37 PM, Yuri Nikonovich <utagai...@gmail.com>
wrote:

> Hello, Bryan
> Thanks for the answer.
> You've understood me correctly. What I'm trying to achieve is to put some
> validation on the dataset. So I fetch all data with one query from db(I
> can't change this behavior), then I use SplitAvro processor to split it
> into chunks say 1000 records each. After that I want to treat each chunk
> independently, transform each record in a chunk according to my domain
> model, validate it and save. This transform-load work I want to distribute
> across the cluster.
>
> While reading about Nifi I've haven't found any information about flows
> like mine. This fact worries me a little. Maybe I'm trying to do something
> that is not suitable for Nifi.
>
> Is Nifi a suitable tool for processing large files or I should not do
> actual processing work outside the Nifi flow?
>
> 2016-06-01 17:28 GMT+03:00 Bryan Bende <bbe...@gmail.com>:
>
>> Hello,
>>
>> This post [1] has a description of how to redistribute data with in the
>> same cluster. You are correct that it involves a RPG pointing back to the
>> same cluster.
>>
>> One thing to keep in mind is that typically we do this with a List +
>> Fetch pattern, where the List operation produces lightweight results like
>> the list of filenames to fetch, then redistributes those results and the
>> fetching happens in parallel.
>> In your case, if i understand it correctly, you will have already fetched
>> the data on the first node, and then have to transfer the actual data to
>> the cluster nodes which could have some overhead.
>>
>> It might require a custom processor to do this, but you might want to
>> consider somehow determining what needs to be fetched after receiving the
>> HTTP request, and redistributing that so each node can then fetch from the
>> DB in parallel.
>>
>> Let me know if this doesn't make sense.
>>
>> -Bryan
>>
>> [1]
>> https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
>>
>>
>> On Wed, Jun 1, 2016 at 6:06 AM, Yuri Nikonovich <utagai...@gmail.com>
>> wrote:
>>
>>> Hi
>>> I have the following flow:
>>> Receive HTTP request -> Fetch data from db -> split it in chunks of
>>> fixed size -> process each chunk and save it to Cassandra.
>>>
>>> I've built a flow and it works perfectly on non-clustered setup. But
>>> when I configured clustered setup
>>> I found out that all heavy work is done only on one node. So if the flow
>>> has started on node1 it will run to the end on node1. What I want to
>>> achieve is to spread data chunks fetched from DB across the cluster in
>>> order to process them in parallel, but it looks like Nifi doesn't send flow
>>> files between nodes in a cluster.
>>> As far as I understand, in order to make node send data to another node
>>> I should create a remote process group and send data to this RPG. All
>>> examples I could find on Internet describe RPGs as cluster-to-cluster
>>> communication or remote node-to-cluster communication. So for my case, I
>>> assume, have to create RPG pointing to the same cluster. Could you please
>>> provide me a guide how to do this.
>>>
>>>
>>> --
>>> Regards,
>>> Nikanovich Yury
>>>
>>
>>
>
>
> --
> С уважением,
> Юрий Никонович
>

Re: Which processor to use to cleanly convert xml to json?

2016-06-01 Thread Bryan Bende

Hi Keith,

There is currently no built in processor that directly transforms XML to
JSON.

TransformXML leverages XSLT to transform and XML document into some other
format.
In that post, the XSLT happens to transform into JSON, but it looks like
maybe it only handles top-level elements and not nesting.

I would say your options would be to modify that stylesheet to support
nested elements, or if you have a specific well-defined XML format you
could write a custom processor that is specific to your format.
For a custom processor your possibly generate JAXB objects from your XML
schema, unmarshall the XML into those objects, then remarshall them as JSON.

Others may have additional suggestions of something that could be done
through ExecuteScript.

-Bryan

On Wed, Jun 1, 2016 at 12:10 PM, Keith Lim  wrote:

> Any help guidance much appreciated.
>
> Thanks,
> Keith
> --
> From: Keith Lim 
> Sent: ‎5/‎31/‎2016 4:07 PM
> To: users@nifi.apache.org
> Subject: Which processor to use to cleanly convert xml to json?
>
> Which processor should I use to cleanly convert from xml to json?
>
> This article illustrates using TransformXML with a stylesheet to convert
> xml to json.
>
>
> https://community.hortonworks.com/articles/29474/nifi-converting-xml-to-json.html
>
> However, I am seeing that it does not convert values with special tag such
> the embedded  tag as below:
>
>
> XML fragment:
>
> 
>
> LakeRiverNational_State Park
>
> A:Value1B:Value2C:Value3D:Value4E:Value5
>
> 
>
>
> Converted Json:
>
>
> { "record" : {
>
> "property1" : { "BR" :["",""] },
> "property2" : { "BR" :["","","",""] }
>
> }}
>
> I may need to readup on stylesheet however, this is just the problem I am
> seeing, and don't know what other issue may crop up using this script.
>
> Thanks,
> Keith
>
>

Re: How to configure site-to-site communication between nodes in one cluster.

2016-06-01 Thread Bryan Bende

Hello,

This post [1] has a description of how to redistribute data with in the
same cluster. You are correct that it involves a RPG pointing back to the
same cluster.

One thing to keep in mind is that typically we do this with a List + Fetch
pattern, where the List operation produces lightweight results like the
list of filenames to fetch, then redistributes those results and the
fetching happens in parallel.
In your case, if i understand it correctly, you will have already fetched
the data on the first node, and then have to transfer the actual data to
the cluster nodes which could have some overhead.

It might require a custom processor to do this, but you might want to
consider somehow determining what needs to be fetched after receiving the
HTTP request, and redistributing that so each node can then fetch from the
DB in parallel.

Let me know if this doesn't make sense.

-Bryan

[1]
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

On Wed, Jun 1, 2016 at 6:06 AM, Yuri Nikonovich  wrote:

> Hi
> I have the following flow:
> Receive HTTP request -> Fetch data from db -> split it in chunks of fixed
> size -> process each chunk and save it to Cassandra.
>
> I've built a flow and it works perfectly on non-clustered setup. But when
> I configured clustered setup
> I found out that all heavy work is done only on one node. So if the flow
> has started on node1 it will run to the end on node1. What I want to
> achieve is to spread data chunks fetched from DB across the cluster in
> order to process them in parallel, but it looks like Nifi doesn't send flow
> files between nodes in a cluster.
> As far as I understand, in order to make node send data to another node I
> should create a remote process group and send data to this RPG. All
> examples I could find on Internet describe RPGs as cluster-to-cluster
> communication or remote node-to-cluster communication. So for my case, I
> assume, have to create RPG pointing to the same cluster. Could you please
> provide me a guide how to do this.
>
>
> --
> Regards,
> Nikanovich Yury
>

Re: IDE-specific setup

2016-06-21 Thread Bryan Bende

I personally use IntelliJ and it generally does pretty well at
automatically importing everything.

The process to import an existing Maven project is described here:
https://www.jetbrains.com/help/idea/2016.1/importing-project-from-maven-model.html

In Step #2 you would select the root directory of the nifi source code that
you cloned from git. After that I think all of the defaults should work.

-Bryan


On Tue, Jun 21, 2016 at 4:47 PM, Kumiko Yada  wrote:

> I downloaded the source code by git clone
> http://git-wip-us.apache.org/repos/asf/nifi.git.  Is it possible import
> all aggravate projects in Eclipse?  When I imported the Maven proejcts, I
> have been getting many errors and How everyone is setting up the Nifi
> source code?
>
>
>
> I also imported my custom processor project in Eclipse, I can’t get the
> Intellisense to work that I will get the error “This compilation unit is
> not on the build path on a Java project).  I’m a beginner level on Java, so
> I have a limited knowledge to any Java IDEs.
>
>
>
> Thanks
>
> Kumiko
>
>
>
> *From:* Andy LoPresto [mailto:alopre...@apache.org]
> *Sent:* Tuesday, June 21, 2016 10:10 AM
>
> *To:* users@nifi.apache.org
> *Subject:* Re: IDE-specific setup
>
>
>
> For more developer-focused information, you should also consider joining
> the dev mailing list [1].
>
>
>
> [1] https://nifi.apache.org/mailing_lists.html
>
>
>
> Andy LoPresto
>
> alopre...@apache.org
>
> *alopresto.apa...@gmail.com *
>
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
>
>
> On Jun 21, 2016, at 10:06 AM, Kumiko Yada  wrote:
>
>
>
> Hi Andy,
>
>
>
> I read several emails from the user mailing list archive that someone was
> going to write up the how to setup the IDE.  However, I didn’t find any
> documents, so I was just wondering.
>
>
>
> Thanks
>
> Kumiko
>
>
>
> *From:* Andy LoPresto [mailto:alopre...@apache.org ]
>
> *Sent:* Tuesday, June 21, 2016 10:03 AM
> *To:* users@nifi.apache.org
> *Subject:* Re: IDE-specific setup
>
>
>
> Kumiko,
>
>
>
> The Contributor Guide [1] has some information about configuring the NiFi
> checkstyle rules for IntelliJ IDEA and Eclipse. Are you looking for
> something specific?
>
>
>
> [1]
> https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-CodeStyle
>
>
>
> Andy LoPresto
>
> alopre...@apache.org
>
> *alopresto.apa...@gmail.com *
>
> PGP Fingerprint: 70EC B3E5 98A6 5A3F D3C4  BACE 3C6E F65B 2F7D EF69
>
>
>
> On Jun 21, 2016, at 9:57 AM, Kumiko Yada  wrote:
>
>
>
> Hello,
>
>
>
> Is there any documentations on how to setup the Nifi with IDE-specific
> setup?
>
>
>
> Thanks
>
> Kumiko
>
>
>

Re: Scheduling using CRON driven on Windows OS

2016-06-16 Thread Bryan Bende

Also, the user guide has a description of the scheduling strategies which
described the cron format:

https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#scheduling-tab

On Thu, Jun 16, 2016 at 1:17 PM, Pierre Villard  wrote:

> Hi Keith,
>
> This is the expected behavior, the first parameter is indeed seconds so
> that */5 * * * * ? will generate a FF every 5 seconds.
> In your case, I believe you'd like something like 00 02 10 * * ?
>
> Hope this helps.
>
> 2016-06-16 19:13 GMT+02:00 Keith Lim :
>
>> My GenerateFlowFile processor is enabled and started
>>
>> with
>>
>>
>> Scheduling Strategy: Cron Driven
>>
>> Run Schedule : 02 10 * * * ?
>>
>>
>> This I expects to generate a flow file daily at 10:02 am, but from my
>> limited test, it seems to take the second parameter as minutes, and
>> generate a flowfile hourly at 10 minutes after the hour, e.g. 10:10, 11:10,
>> 12:10, 13:10...
>>
>> Perhaps there is a bug in parsing system date format?
>>
>>
>> Attached is a simple template that I am using for testing.
>>
>>
>> Thanks,
>> Keith
>>
>>
>> --
>> *From:* Andrew Grande 
>> *Sent:* Wednesday, June 15, 2016 4:10 PM
>> *To:* users@nifi.apache.org
>> *Subject:* Re: Scheduling using CRON driven on Windows OS
>>
>>
>> Keith,
>>
>> Was your processor running at all times? It has to be started and enabled.
>>
>> I guess sharing the cron expression and maybe a quick screenshot will
>> help next.
>>
>> Andrew
>>
>> On Wed, Jun 15, 2016, 6:58 PM Keith Lim  wrote:
>>
>>> I tried setting Scheduling Strategy property to CRON Driven but does not
>>> seem to work.   Sometimes it would fire when not expected to and others not
>>> fire when expected to.
>>> This is on Windows OS and the processor I tried was GenerateFlowFile.
>>> Is CRON Driven setting not designed to work on Windows OS?
>>>
>>> Thanks,
>>> Keith
>>>
>>>
>>>
>

Re: Merge ListenSyslog events

2016-01-08 Thread Bryan Bende

Hello,

Glad to hear you are getting started using ListenSyslog!

You are definitely running into something that we should consider
supporting. The current implementation treats each new-line as the message
delimiter and places each message on to a queue.

When the processor is triggered, it grabs messages from the queue up to the
"Max Batch Size". So in the default case it grabs a single message from the
queue, which in your case is a single line
from one of the mult-line messages, and produces a FlowFile. When "Max
Batch Size" is set higher to say 100, it grabs up to 100 messages and
produces a FlowFile containing all 100 messages.

The messages in the queue are simultaneously coming from all of the
incoming connections, so this is why you don't see all the lines from one
server in the same order. Imagine the queue having something like:

java-server-1 message1 line1
java-server-2 message1 line1
java-server-1 message1 line2
java-server-3 message1 line1
java-server-2 message1 line2

I would need to dig into that splunk documentation a little more, but I
think you are right that we could possibly expose some kind of message
delimiter pattern on the processor which
would be applied when reading the messages, before they even make into the
queue, so that by the time it gets put in the queue it would be all of the
lines from one message.

Given the current situation, there might be one other option for you. Are
you able to control/change the logback/log4j configuration for the servers
sending the logs?

If so, a JSON layout might solve the problem. These configuration files
show how to do that:
https://github.com/bbende/jsonevent-producer/tree/master/src/main/resources

I know this worked well with the ListenUDP processor to ensure that an
entire stack trace was sent as a single JSON document, but I have not had a
chance to try it with ListenSyslog and the SyslogAppender.
If you are using ListenSyslog with TCP, then it will probably come down to
whether logback/log4j puts new-lines inside the JSON document, or only a
single new-line at the end.

-Bryan

On Fri, Jan 8, 2016 at 11:36 AM, Louis-Étienne Dorval 
wrote:

> Hi everyone!
>
> I'm looking to use the new ListenSyslog processor in a proof-of-concept
> [project but I encounter a problem that I can find a suitable solution
> (yet!).
> I'm receiving logs from multiple Java-based server using a logback/log4j
> SyslogAppender. The messages are received successfully but when a stack
> trace happens, each lines are broken into single FlowFile.
>
> I'm trying to achieve something like the following:
> http://docs.splunk.com/Documentation/Splunk/6.2.2/Data/Indexmulti-lineevents
>
> I tried:
> - Increasing the "Max Batch Size", but I end up merging lines that should
> not be merge and there's no way to know then length of the stack trace...
> - Use MergeContent using the host as "Correlation Attribute Name", but as
> before I merge lines that should not be merge
> - Use MergeContent followed by SplitContent, that might work but the
> SplitContent is pretty restrictive and I can't find a "Byte Sequence" that
> are different from stack trace.
>
> Even if I find a magic "Byte Sequence" for my last try (MergeContent +
> SplitContent), I would most probably lose a part of the stacktrace as the
> MergeContent is limited by the "Max Batch Size"
>
>
> The only solution that I see is to modify the ListenSyslog to add some
> similar parameter as the Splunk documentation explains and use that rather
> than a fixed "Max Batch Size".
>
> Am I missing a another option?
> Would that be a suitable feature? (maybe I should ask that question in the
> dev mailing list)
>
> Best regards!
>

Re: Nifi cluster features - Questions

2016-01-10 Thread Bryan Bende

Chakri,

Glad you got site-to-site working.

Regarding the data distribution, I'm not sure why it is behaving that way.
I just did a similar test running ncm, node1, and node2 all on my local
machine, with GenerateFlowFile running every 10 seconds, and Input Port
going to a LogAttribute, and I see it alternating between node1 and node2
logs every 10 seconds.

Is there anything in your primary node logs
(primary_node/logs/nifi-app.log) when you see the data on the other node?

-Bryan


On Sun, Jan 10, 2016 at 3:44 PM, Joe Witt <joe.w...@gmail.com> wrote:

> Chakri,
>
> Would love to hear what you've learned and how that differed from the
> docs themselves.  Site-to-site has proven difficult to setup so we're
> clearly not there yet in having the right operator/admin experience.
>
> Thanks
> Joe
>
> On Sun, Jan 10, 2016 at 3:41 PM, Chakrader Dewaragatla
> <chakrader.dewaraga...@lifelock.com> wrote:
> > I was able to get site-to-site work.
> > I tried to follow your instructions to send data distribute across the
> > nodes.
> >
> > GenerateFlowFile (On Primary) —> RPG
> > RPG —> Input Port   —> Putfile (Time driven scheduling)
> >
> > However, data is only written to one slave (Secondary slave). Primary
> slave
> > has not data.
> >
> > Image screenshot :
> > http://tinyurl.com/jjvjtmq
> >
> > From: Chakrader Dewaragatla <chakrader.dewaraga...@lifelock.com>
> > Date: Sunday, January 10, 2016 at 11:26 AM
> >
> > To: "users@nifi.apache.org" <users@nifi.apache.org>
> > Subject: Re: Nifi cluster features - Questions
> >
> > Bryan – Thanks – I am trying to setup site-to-site.
> > I have two slaves and one NCM.
> >
> > My properties as follows :
> >
> > On both Slaves:
> >
> > nifi.remote.input.socket.port=10880
> > nifi.remote.input.secure=false
> >
> > On NCM:
> > nifi.remote.input.socket.port=10880
> > nifi.remote.input.secure=false
> >
> > When I try drop remote process group (with http://:8080/nifi),
> I see
> > error as follows for two nodes.
> >
> > [:8080] - Remote instance is not allowed for Site to Site
> > communication
> > [:8080] - Remote instance is not allowed for Site to Site
> > communication
> >
> > Do you have insight why its trying to connecting 8080 on slaves ? When do
> > 10880 port come into the picture ? I remember try setting site to site
> few
> > months back and succeeded.
> >
> > Thanks,
> > -Chakri
> >
> >
> >
> > From: Bryan Bende <bbe...@gmail.com>
> > Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
> > Date: Saturday, January 9, 2016 at 11:22 AM
> > To: "users@nifi.apache.org" <users@nifi.apache.org>
> > Subject: Re: Nifi cluster features - Questions
> >
> > The sending node (where the remote process group is) will distribute the
> > data evenly across the two nodes, so an individual file will only be
> sent to
> > one of the nodes. You could think of it as if a separate NiFi instance
> was
> > sending directly to a two node cluster, it would be evenly distributing
> the
> > data across the two nodes. In this case it just so happens to all be
> with in
> > the same cluster.
> >
> > The most common use case for this scenario is the List and Fetch
> processors
> > like HDFS. You can perform the listing on primary node, and then
> distribute
> > the results so the fetching takes place on all nodes.
> >
> > On Saturday, January 9, 2016, Chakrader Dewaragatla
> > <chakrader.dewaraga...@lifelock.com> wrote:
> >>
> >> Bryan – Thanks, how do the nodes distribute the load for a input port.
> As
> >> port is open and listening on two nodes,  does it copy same files on
> both
> >> the nodes?
> >> I need to try this setup to see the results, appreciate your help.
> >>
> >> Thanks,
> >> -Chakri
> >>
> >> From: Bryan Bende <bbe...@gmail.com>
> >> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
> >> Date: Friday, January 8, 2016 at 3:44 PM
> >> To: "users@nifi.apache.org" <users@nifi.apache.org>
> >> Subject: Re: Nifi cluster features - Questions
> >>
> >> Hi Chakri,
> >>
> >> I believe the DistributeLoad processor is more for load balancing when
> >> sending to downstream systems. For example, if you had two HTTP
> endpoints,
> >> you could have the first relationship from DistributeLoad going to a

Re: MergeContent: Correlation Attribute Name syntax for matching syslog events

2016-02-07 Thread Bryan Bende

I believe what Joe was referring to with RouteText was that it can take a
regular expression with a capture group, and output a FlowFile per unique
value of the capturing group. So if the incoming data is a FlowFile with a
bunch of syslog messages and you provide a regex that captures hostname, it
can produced a FlowFile per unique hostname with all the messages that go
with that hostname.

I don't want to side track the conversation about how to use MergeContent
properly, but wanted to add a couple of things about how ListenSyslog
works...

There is an attribute called "syslog.sender" which is the host that the
message was received from, the value is populated from the incoming
connection in Java code, not from anything in the syslog message. This
should essentially be the host of the syslog server/forwarder.

There is an attribute called "syslog.hostname" which is the hostname in the
syslog message itself, which should be the host that produced that message
and sent it to a syslog server.

By default ListenSyslog has parse set to true and batch size set to 1. If
you set parse to false and increase the batch size to say 100, it will try
to grab a maximum of 100 messages in each execution of the processor (could
be less depending on timing and what is available), and for those 100
messages it groups them by the "sender" (described above) and outputs a
flow file per sender.

Batching can definitely get much higher through put on ListenSyslog, but if
you have to parse them later in the flow with ParseSyslog then you still
need to get each message into its own FlowFile, which most likely entails
SplitText with a line count of 1 and then ParseSyslog. I don't know if this
turns out much better then just letting ListenSyslog parse them in the
first place. If you are letting ListenSyslog do the parsing then you can
increase the concurrent tasks on the processor which means more threads
parsing syslog messages and outputing FlowFiles.

I think the batching concept makes the most sense when you don't need to
parse the messages and just want to deliver the raw messages somewhere like
HDFS, or Kafka.

-Bryan

On Sun, Feb 7, 2016 at 10:03 AM, Andre  wrote:

>
> > You can use RouteText to group (rather than split) on some shared
> pattern such as the hostname.  Will be far more efficient than splitting
> each line then grouping on that hostname.
>
> Not sure I understand?
>
>
>

Re: Send parameters while using getmongo processor

2016-02-04 Thread Bryan Bende

A good example is probably the recently added FetchElasticSearch
processor...

a) See the DOC_ID property with expressionLanguageSupported(true)
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/FetchElasticsearch.java#L87

b) In onTrigger when getting the value for docId - final String docId =
context.getProperty(DOC_ID).evaluateAttributeExpressions(flowFile).getValue();
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/FetchElasticsearch.java#L152

This would assume that if you set the DOC_ID property to expression
language like ${my.id} that something before this processor was setting "
my.id" on each FlowFIle.

This would depend on your use case, but for testing you could setup a flow
with..
 GenerateFlowFile (change the scheduling to every few seconds) ->
UpdateAttribute (set your id attribute) -> FetchMongo

Let us know if that doesn't help.

On Thu, Feb 4, 2016 at 11:28 AM, sudeep mishra <sudeepshekh...@gmail.com>
wrote:

> Hi Bryan,
>
> I am trying to create a processor on the lines of getmongo processor to
> suit my needs.
>
> Can you please guide me how can I
>
> a) specify a property to support expression language?
> b) assign the content or attribute of a flow file as the query parameter?
>
> Regards,
>
> Sudeep
>
> On Thu, Feb 4, 2016 at 9:11 PM, sudeep mishra <sudeepshekh...@gmail.com>
> wrote:
>
>> Thanks for the feedback Bryan.
>>
>> Yes I need a processor similar to what you described.
>>
>> On Thu, Feb 4, 2016 at 7:38 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>>> Hi Sudeep,
>>>
>>> From looking at the GetMongo processor, I don't think this can be done
>>> today. That processor is meant to be a source processor that extracts data
>>> from Mongo using a fixed query.
>>> It seems to me like we would need a FetchMongo processor with a Query
>>> field that supported expression language, and you could set it to
>>> ContractNumber = ${flow.flile.contract.number}
>>> Then incoming FlowFiles would have "flow.file.contract.number" as an
>>> attribute and it would fetch documents matching that.
>>>
>>> I don't know that much about MongoDB, but does that sound like what you
>>> need?
>>>
>>> -Bryan
>>>
>>>
>>> On Thu, Feb 4, 2016 at 8:00 AM, sudeep mishra <sudeepshekh...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have following schema of records in MongoDB.
>>>>
>>>> {
>>>> "_id" : ObjectId("56b1958a1ebecc0724588c39"),
>>>> "ContractNumber" : "ABC87gdtr53",
>>>> "DocumentType" : "TestDoc",
>>>> "FlowNr" : 3,
>>>> "TimeStamp" : "03/02/2016 05:51:09:023"
>>>> }
>>>>
>>>> How can I query for a particular contract by 'ContractNumber' using
>>>> getmongo processor? I want to dynamically pass the value of ContractNumber
>>>> and get back the results.
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Sudeep
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Sudeep
>>
>
>
>

Re: Send parameters while using getmongo processor

2016-02-04 Thread Bryan Bende

Hi Sudeep,

>From looking at the GetMongo processor, I don't think this can be done
today. That processor is meant to be a source processor that extracts data
from Mongo using a fixed query.
It seems to me like we would need a FetchMongo processor with a Query field
that supported expression language, and you could set it to ContractNumber
= ${flow.flile.contract.number}
Then incoming FlowFiles would have "flow.file.contract.number" as an
attribute and it would fetch documents matching that.

I don't know that much about MongoDB, but does that sound like what you
need?

-Bryan


On Thu, Feb 4, 2016 at 8:00 AM, sudeep mishra 
wrote:

> Hi,
>
> I have following schema of records in MongoDB.
>
> {
> "_id" : ObjectId("56b1958a1ebecc0724588c39"),
> "ContractNumber" : "ABC87gdtr53",
> "DocumentType" : "TestDoc",
> "FlowNr" : 3,
> "TimeStamp" : "03/02/2016 05:51:09:023"
> }
>
> How can I query for a particular contract by 'ContractNumber' using
> getmongo processor? I want to dynamically pass the value of ContractNumber
> and get back the results.
>
>
> Thanks & Regards,
>
> Sudeep
>

Re: Log4j/logback parser via syslog

2016-02-12 Thread Bryan Bende

I believe groovy, python, jython, jruby, ruby, javascript, and lua.

The associated JIRA is here:
https://issues.apache.org/jira/browse/NIFI-210

There are some cool blogs about them here:
http://funnifi.blogspot.com/2016/02/executescript-processor-hello-world.html

-Bryan

On Fri, Feb 12, 2016 at 10:48 AM, Madhukar Thota <madhukar.th...@gmail.com>
wrote:

> Thanks Bryan. I will look into ExtractText processor.
>
> Do you know what scripting languages are supported with new processors?
>
> -Madhu
>
> On Fri, Feb 12, 2016 at 9:27 AM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Hello,
>>
>> Currently there are no built in processors to parse log formats, but have
>> you taken a look at the ExtractText processor [1]?
>>
>> If you can come up with a regular expression for whatever you are trying
>> to extract, then you should be able to use ExtractText.
>>
>> Other options...
>>
>> You could write a custom processor, but this sounds like it might be
>> overkill for your scenario.
>> In the next release (hopefully out in a few days) there will be two new
>> processors that support scripting languages. It may be easier to use a
>> scripting language to manipulate/parse the text.
>>
>> Thanks,
>>
>> Bryan
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ExtractText/index.html
>>
>>
>> On Fri, Feb 12, 2016 at 12:16 AM, Madhukar Thota <
>> madhukar.th...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I am very new to Apache Nifi and just started learning about how to use
>>> it.
>>>
>>> We have a requirement where we need to parse log4j/logback pattern
>>> messages coming from SyslogAppenders via Syslog udp. I can read the
>>> standard syslog messages, but how can i further extract log4j/logback
>>> messages  from syslog body.
>>>
>>> Is there any log parsers( log4j/logback/Apache access log format)
>>> available in apache nifi?
>>>
>>>
>>> Any help on this much appreciated.
>>>
>>> Thanks in Advance.
>>>
>>>
>>
>

Re: Add date in the flow file attribute

2016-02-03 Thread Bryan Bende

Just adding to what Juan said...

The PutMongo processor sends the content of a FlowFile to Mongo, so if you
use AttributesToJson -> PutMongo, with AttributesToJson Destination set to
flowfile-content, then you would be sending the attributes to Mongo.

-Bryan

On Wed, Feb 3, 2016 at 9:22 AM, Juan Sequeiros  wrote:

> Sudeep,
>
> You can pass the attributes by expressing them like this: ${key}
> Or AttributeToJson:
>
> "
> *Destination* flowfile-attribute
>
>- flowfile-attribute
>- flowfile-content
>
> Control if JSON value is written as a new flowfile attribute
> 'JSONAttributes' or written in the flowfile content. Writing to flowfile
> content will overwrite any existing flowfile content.
>
> On Tue, Feb 2, 2016 at 10:51 PM, sudeep mishra 
> wrote:
>
>> Hi Joe,
>>
>> Looks like I did not phrase my question correctly. AttributeToJson works
>> fine and the documentation is also detailed.
>>
>> What I am looking for is a way to get the attributes of a flow file and
>> pass only those to other processor.
>>
>> On Tue, Feb 2, 2016 at 10:50 PM, Joe Percivall 
>> wrote:
>>
>>> Glad UpdateAttribute works for you.
>>>
>>> You are seeing AttributeToJson append the information to the content?
>>> That is not what the documentation says or how it should be behaving
>>> (should replace the contents). Could you send more information documenting
>>> this?
>>>
>>> Joe
>>> - - - - - -
>>> Joseph Percivall
>>> linkedin.com/in/Percivall
>>> e: joeperciv...@yahoo.com
>>>
>>>
>>>
>>> On Tuesday, February 2, 2016 12:11 PM, sudeep mishra <
>>> sudeepshekh...@gmail.com> wrote:
>>>
>>>
>>>
>>> Thanks Joe.
>>>
>>> The UpdateAttribute processor can be helpful for my case. Also is it
>>> possible to push only the attributes to  Mongo? I could see an
>>> AttributeToJson object but it seems to be appending the information in flow
>>> file content or attribute. What is a good way to capture only attributes
>>> and send it to MongoDb?
>>>
>>>
>>>
>>> On Tue, Feb 2, 2016 at 8:42 PM, Joe Percivall 
>>> wrote:
>>>
>>> Hello Sudeep,
>>> >
>>> >How precise do you need the date/time to be? What you could do is add
>>> an UpdateAttribute processor[1] after ingesting which uses the Expression
>>> language functions "now" [2] and "format" [3] to add the date/time down to
>>> the millisecond.
>>> >
>>> >There would of course be a bit of error between when it was ingested
>>> and when it is processed by UpdateAttribute but UpdateAttribute is very
>>> fast and there may actually not be any measurable delay.
>>> >
>>> >[1]
>>> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.attributes.UpdateAttribute/index.html
>>> >[2]
>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#now
>>> >[3]
>>> https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html#format
>>> >
>>> >Hope that helps,
>>> >Joe
>>> >- - - - - -
>>> >Joseph Percivall
>>> >linkedin.com/in/Percivall
>>> >e: joeperciv...@yahoo.com
>>> >
>>> >
>>> >
>>> >
>>> >On Tuesday, February 2, 2016 1:17 AM, sudeep mishra <
>>> sudeepshekh...@gmail.com> wrote:
>>> >
>>> >
>>> >
>>> >Hi,
>>> >
>>> >I need to create some audits around the NiFi flows and want to add the
>>> time a flow file was received by a particular processor. Is there a way to
>>> add this date in the attributes for flow files?
>>> >
>>> >I can see a date in the 'Details' section for a data provenance entry
>>> but can we get such a date in the attributes as well?
>>> >
>>> >
>>> >Thanks & Regards,
>>> >
>>> >Sudeep
>>> >
>>>
>>>
>>> --
>>>
>>> Thanks & Regards,
>>>
>>> Sudeep
>>>
>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> Sudeep
>>
>>
>
>
> --
> Juan Carlos Sequeiros
>

Re: GetMail processor

2016-02-22 Thread Bryan Bende

Phil,

A GetEmail processor is planned. There is a JIRA for it, most likely for
the 0.6.0 release:

https://issues.apache.org/jira/browse/NIFI-1148

-Bryan


On Fri, Feb 19, 2016 at 4:48 AM, <philippe.gib...@orange.com> wrote:

> Hi,
>
> I would like to know if a GetEmail processor is available somewhere or
> planned J . I have seen PutEmail but not the dual processor in the help
>
>
>
> The goal is to automatically and regularly processes incoming mail  ,
> transforms the content and index the transformed content with solr
>
> phil
>
> Best regards
>
>
>
> *De :* Bryan Bende [mailto:bbe...@gmail.com]
> *Envoyé :* jeudi 4 février 2016 15:08
> *À :* users@nifi.apache.org
> *Objet :* Re: Send parameters while using getmongo processor
>
>
>
> Hi Sudeep,
>
>
>
> From looking at the GetMongo processor, I don't think this can be done
> today. That processor is meant to be a source processor that extracts data
> from Mongo using a fixed query.
>
> It seems to me like we would need a FetchMongo processor with a Query
> field that supported expression language, and you could set it to
> ContractNumber = ${flow.flile.contract.number}
>
> Then incoming FlowFiles would have "flow.file.contract.number" as an
> attribute and it would fetch documents matching that.
>
>
>
> I don't know that much about MongoDB, but does that sound like what you
> need?
>
>
>
> -Bryan
>
>
>
>
>
> On Thu, Feb 4, 2016 at 8:00 AM, sudeep mishra <sudeepshekh...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I have following schema of records in MongoDB.
>
>
>
> {
>
> "_id" : ObjectId("56b1958a1ebecc0724588c39"),
>
> "ContractNumber" : "ABC87gdtr53",
>
> "DocumentType" : "TestDoc",
>
> "FlowNr" : 3,
>
> "TimeStamp" : "03/02/2016 05:51:09:023"
>
> }
>
>
>
> How can I query for a particular contract by 'ContractNumber' using
> getmongo processor? I want to dynamically pass the value of ContractNumber
> and get back the results.
>
>
>
>
>
> Thanks & Regards,
>
>
>
> Sudeep
>
>
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>

Re: Connecting Spark to Nifi 0.4.0

2016-02-22 Thread Bryan Bende

Hi Kyle,

It seems like the stack trace is suggesting that Spark is trying to
download dependencies from the like that references
Executor.updateDependencies:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L391

Any chance you are behind some kind of firewall preventing this?

I'm not that familiar with Spark streaming, but I also noticed in one of
the tutorials that it did something like this:

spark.driver.extraClassPath
/opt/spark-receiver/nifi-spark-receiver-0.4.1.jar:/opt/spark-receiver/nifi-site-to-site-client-0.4.1.jar:/opt/nifi-1.1.1.0-12/lib/nifi-api-1.1.1.0-12.jar:/opt/nifi-1.1.1.0-12/lib/bootstrap/nifi-utils-1.1.1.0-12.jar:/opt/nifi-1.1.1.0-12/work/nar/framework/nifi-framework-nar-1.1.1.0-12.nar-unpacked/META-INF/bundled-dependencies/nifi-client-dto-1.1.1.0-12.jar

Which I would think means it wouldn't have to go out and download the NiFi
dependencies if it is being provided on the class path, but again not
really sure.

-Bryan


On Mon, Feb 22, 2016 at 1:09 PM, Kyle Burke 
wrote:

> Joe,
>I’m not sure what to do with Bryan’s comment. The spark code I’m
> running has no problem reading from a Kafka receiver. I only get the error
> when trying to read from a Nifi receiver. When I create a Nifi flow that
> reads from the same kafka stream and sends the data to our outport port I
> get the issue.
>
> Respectfully,
>
>
> Kyle Burke | Data Science Engineer
> IgnitionOne - Marketing Technology. Simplified.
> Office: 1545 Peachtree St NE, Suite 500 | Atlanta, GA | 30309
> Direct: 404.961.3918
>
>
>
>
>
>
>
>
>
> On 2/22/16, 1:00 PM, "Joe Witt"  wrote:
>
> >Kyle,
> >
> >Did you get a chance to look into what Bryan mentioned?  He made a
> >great point in that the stacktrace doesn't seem to have any
> >relationship to NiFi or NiFi's site-to-site code.
> >
> >Thanks
> >Joe
> >
> >On Mon, Feb 22, 2016 at 12:58 PM, Kyle Burke 
> wrote:
> >> Telnet leads me to believe the port is open. (I upgrade to 0.5.0 today
> in
> >> hopes that it will help but no luck)
> >>
> >> From Telnet:
> >>
> >> 12:50:11 [~/Dev/nifi/nifi-0.5.0] $ telnet localhost 8080
> >>
> >> Trying ::1...
> >>
> >> Connected to localhost.
> >>
> >> Escape character is '^]’.
> >>
> >>
> >> Respectfully,
> >>
> >> Kyle Burke | Data Science Engineer
> >> IgnitionOne - Marketing Technology. Simplified.
> >> Office: 1545 Peachtree St NE, Suite 500 | Atlanta, GA | 30309
> >> Direct: 404.961.3918
> >>
> >>
> >> From: Joe Witt
> >> Reply-To: "users@nifi.apache.org"
> >> Date: Saturday, February 20, 2016 at 5:16 PM
> >> To: "users@nifi.apache.org"
> >> Subject: Re: Connecting Spark to Nifi 0.4.0
> >>
> >> Kyle
> >>
> >> Can you try connecting to that nifi port using telnet and see if you are
> >> able?
> >>
> >> Use the same host and port as you are in your spark job.
> >>
> >> Thanks
> >> Joe
> >>
> >> On Feb 20, 2016 4:55 PM, "Kyle Burke" 
> wrote:
> >>>
> >>> All,
> >>>I’m attempting to connect Spark to Nifi but I’m getting a “connect
> >>> timed out” error when spark tries to pull records from the input port.
> I
> >>> don’t understand why I”m getting the issue because nifi and spark are
> both
> >>> running on my local laptop. Any suggestions about how to get around the
> >>> issue?
> >>>
> >>> It appears that nifi is listening on the port because I see the
> following
> >>> when running the lsof command:
> >>>
> >>> java31455 kyle.burke 1054u  IPv4 0x1024ddd67a640091  0t0  TCP
> >>> *:9099 (LISTEN)
> >>>
> >>>
> >>> I’ve been following the instructions give in these two articles:
> >>> https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
> >>>
> >>>
> https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html
> >>>
> >>> Here is how I have my nifi.properties setting:
> >>>
> >>> # Site to Site properties
> >>>
> >>> nifi.remote.input.socket.host=
> >>>
> >>> nifi.remote.input.socket.port=9099
> >>>
> >>> nifi.remote.input.secure=false
> >>>
> >>>
> >>> Below is the full error stack:
> >>>
> >>> 16/02/20 16:34:45 ERROR Executor: Exception in task 0.0 in stage 0.0
> (TID
> >>> 0)
> >>>
> >>> java.net.SocketTimeoutException: connect timed out
> >>>
> >>> at java.net.PlainSocketImpl.socketConnect(Native Method)
> >>>
> >>> at
> >>>
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> >>>
> >>> at
> >>>
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> >>>
> >>> at
> >>>
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> >>>
> >>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> >>>
> >>> at java.net.Socket.connect(Socket.java:589)
> >>>
> >>> at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
> >>>
> >>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
> >>>
> >>> at

Re: Newbie looking for docs subdirectory in NAR

2016-01-20 Thread Bryan Bende

Hi Russ,

If you have the traditional bundle with a jar project for your processors
and a nar project that packages everything, then the additionalDetails.html
goes in the jar project under src/main/resources/docs followed by a
directory with the fully qualified class name of your processor.

As an example, here is the solr bundle:

https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-solr-bundle/nifi-solr-processors/src/main/resources/docs/org.apache.nifi.processors.solr.PutSolrContentStream

It will end up inside your processors jar which in turn will end up inside
your nar, thus making it available on the NiFi classpath at runtime.

Let us know if it is still not clear.

Thanks,

Bryan

On Wed, Jan 20, 2016 at 12:50 PM, Russell Bateman <
russell.bate...@perfectsearchcorp.com> wrote:

> I'm trying to add the documented file, *additionalDetails.html*, to the
> proposed *docs* subdirectory under the root of my processor's NAR. It's a
> bit mystifying:
>
>- Where in my (Eclipse/IntelliJ) project do I put it such that the NAR
>plug-in will find it?
>- Where will it end up in the NAR?
>
>
> I'm reading https://nifi.apache.org/developer-guide.html of course.
>
> The existing NiFi sample projects are exceedingly sparse.
>
> Thanks,
>
> Russ
>

Re: splitText output appears to be getting dropped

2016-02-19 Thread Bryan Bende

Hello,

MergeContent has properties for header, demarcator, and footer, and also
has a strategy property which specifies whether these values come from a
file or inline text.

If you do inline text and specify a demarcator of a new line (shift + enter
in the demarcator value) then binary concatenation will get you all of the
lines merged together with new lines between them.

As far as the file naming, can you just wait until after RouteContent to
rename them? They just need be renamed before the PutFile, but it doesn't
necessarily have to be before RouteOnContent.

Let us know if that helps.

Thanks,

Bryan


On Fri, Feb 19, 2016 at 11:01 AM, Conrad Crampton <
conrad.cramp...@secdata.com> wrote:

> Hi,
> Sorry to piggy back on this thread, but I have pretty much the same issue
> – I am splitting log files -> routeoncontent (various paths) two of these
> paths (including unmatched), basically need to just get farmed off into a
> directory just in case they are needed later.
> These go into a MergeContent processor where I would like to merge into
> one file – each flowfile content as a line in the file delimited by line
> feed (as like the original file), whichever way I try this though doesn’t
> quite do what I want. If I try BinaryConcatenation the file ends up as one
> long line, if TAR each Flowfile is a separate file in a TAR (not
> unsurprisingly). There doesn’t seem to be anyway of merging flow file
> content into one file (that ideally has similar functions to be able to
> compress, specify number of files etc.)
>
> Another related question to the answer below (really helped me out with
> same issue), however if I rename the filename early on in my process flow,
> it appears to be changed back to its original at MergeContent processor
> time so I have to put another UpdateAttributes step in after the Merge to
> rename the filename.
> The flow is
>
> UpdateAttributes -> RouteOnContent -> UpdateAttribute -> MergeContent ->
> PutFile
>  ^   ^ ^ ^
>  |   | | |
> Filename changed same same reverted
>
> If I put an extra UpdateAttribute before PutFile then fine. Logging at
> each of the above points shows filename updated to ${uuid}-${filename}, but
> at reverted is back at filename.
>
> Any suggestions on particularly the first question??
>
> Thanks
> Conrad
>
>
>
> From: Jeff Lord 
> Reply-To: "users@nifi.apache.org" 
> Date: Friday, 19 February 2016 at 03:22
> To: "users@nifi.apache.org" 
> Subject: Re: splitText output appears to be getting dropped
>
> Matt,
>
> Thanks a bunch!
> That did the trick.
> Is there a better way to handle this out of curiosity? Than writing out a
> single line into multiple files.
> Each file contains a single string that will be used to build a url.
>
> -Jeff
>
> On Thu, Feb 18, 2016 at 6:00 PM, Matthew Clarke  > wrote:
>
>> Jeff,
>>   It appears you files are being dropped because your are
>> auto-terminating the failure relationship on your putFile processor. When
>> the splitText processor splits the file by lines every new file has the
>> same filename as the original it came from. My guess is the first file is
>> being worked to disk and all others are failing because a file of the same
>> name already exists in target dir. Try adding an UpdateAttribute processor
>> after the splitText to rename all the files. Easiest way is to append the
>> files uuid to its filename.  I also do not recommend auto-terminating
>> failure relationships except in rare cases.
>>
>> Matt
>> On Feb 18, 2016 8:36 PM, "Jeff Lord"  wrote:
>>
>>> I have a pretty simple flow where I query for a list of ids using
>>> executeProcess and than pass that list along to splitText where I am trying
>>> to split on each line to than dynamically build a url further down the line
>>> using updateAttribute and so on.
>>>
>>> executeProcess -> splitText -> putFile
>>>
>>> For some reason I am only getting one file written with one line.
>>> I would expect something more like 100 files each with one line.
>>> Using the provenance reporter it appears that some of my items are being
>>> dropped.
>>>
>>> Time02/18/2016 17:13:46.145 PST
>>> Event DurationNo value set
>>> Lineage Duration00:00:12.187
>>> TypeDROP
>>> FlowFile Uuid7fa42367-490d-4b54-a32f-d062a885474a
>>> File Size14 bytes
>>> Component Id3b37a828-ba2c-4047-ba7a-578fd0684ce6
>>> Component NamePutFile
>>> Component TypePutFile
>>> DetailsAuto-Terminated by failure Relationship
>>>
>>> Any ideas on what I need to change here?
>>>
>>> Thanks in advance,
>>>
>>> Jeff
>>>
>>
>
>
> ***This email originated outside SecureData***
>
> Click here  to
> report this email as spam.
>
>
> SecureData, combating cyber threats
>
> --
>
> The information contained in this message or any of its attachments may be
> privileged and confidential

Re: Connecting Spark to Nifi 0.4.0

2016-02-20 Thread Bryan Bende

Just wanted to point out that the stack trace doesn't actually show the
error coming from code in the NiFi Site-To-Site client, so I wonder if it
is something else related to Spark.

Seems similar to this error, but not sure:
https://stackoverflow.com/questions/27013795/failed-to-run-the-spark-example-locally-on-a-macbook-with-error-lost-task-1-0-i

On Sat, Feb 20, 2016 at 5:16 PM, Joe Witt  wrote:

> Kyle
>
> Can you try connecting to that nifi port using telnet and see if you are
> able?
>
> Use the same host and port as you are in your spark job.
>
> Thanks
> Joe
> On Feb 20, 2016 4:55 PM, "Kyle Burke"  wrote:
>
>> All,
>>I’m attempting to connect Spark to Nifi but I’m getting a “connect
>> timed out” error when spark tries to pull records from the input port. I
>> don’t understand why I”m getting the issue because nifi and spark are both
>> running on my local laptop. Any suggestions about how to get around the
>> issue?
>>
>> *It appears that nifi is listening on the port because I see the
>> following when running the lsof command:*
>>
>> java31455 kyle.burke 1054u  IPv4 0x1024ddd67a640091  0t0  TCP
>> *:9099 (LISTEN)
>>
>>
>> *I’ve been following the instructions give in these two articles:*
>> https://blogs.apache.org/nifi/entry/stream_processing_nifi_and_spark
>>
>> https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html
>>
>> *Here is how I have my nifi.properties setting:*
>>
>> # Site to Site properties
>>
>> nifi.remote.input.socket.host=
>>
>> nifi.remote.input.socket.port=9099
>>
>> nifi.remote.input.secure=false
>>
>>
>> *Below is the full error stack:*
>>
>> 16/02/20 16:34:45 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
>> 0)
>>
>> java.net.SocketTimeoutException: connect timed out
>>
>> at java.net.PlainSocketImpl.socketConnect(Native Method)
>>
>> at
>> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>>
>> at
>> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>>
>> at
>> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>>
>> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>>
>> at java.net.Socket.connect(Socket.java:589)
>>
>> at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>>
>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>>
>> at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>>
>> at sun.net.www.http.HttpClient.(HttpClient.java:211)
>>
>> at sun.net.www.http.HttpClient.New(HttpClient.java:308)
>>
>> at sun.net.www.http.HttpClient.New(HttpClient.java:326)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1168)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1104)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:998)
>>
>> at
>> sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:932)
>>
>> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:555)
>>
>> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:369)
>>
>> at
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)
>>
>> at
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)
>>
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>
>> at
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>
>> at
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>>
>> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>>
>> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>>
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>
>> at org.apache.spark.executor.Executor.org
>> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)
>>
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Respectfully,
>>
>> *Kyle Burke *| Data Science Engineer
>> *IgnitionOne - *Marketing Technology. Simplified.
>>
>

Re: Nifi 0.50 and GetKafka Issues

2016-02-21 Thread Bryan Bende

I know this does not address the larger problem, but in this specific case,
would the 0.4.1 Kafka NAR still work in 0.5.x?

If the NAR doesn't depend on any other NARs I would think it would still
work, and could be a work around for those that need to stay on Kafka 0.8.2.

On Sunday, February 21, 2016, Oleg Zhurakousky 
wrote:

> The unfortunate part is that between 0.8 and 0.9 there are also breaking
> API changes on the Kafka side that would affect our code, so I say we need
> to probably start thinking more about versioning. And in fact we are in the
> concept of extension registry, but what I am now suggesting is that
> versioning must come in isolation from  registry.
>
>
> As far as Zookeeper, i simply pointed out as one of the changes that were
> made., so I am glad it’s not affecting it. I guess to leaves the protocol
> incompatibilities.
>
> Cheers
> Oleg
>
> On Feb 21, 2016, at 5:23 PM, Joe Witt  > wrote:
>
> Yeah the intent is to support 0.8 and 0.9.  Will figure something out.
>
> Thanks
> Joe
> On Feb 21, 2016 4:47 PM, "West, Joshua"  > wrote:
>
>> Hi Oleg,
>>
>> Hmm -- from what I can tell, this isn't a Zookeeper communication issue.
>> Nifi is able to connect into the Kafka brokers' Zookeeper cluster and
>> retrieve the list of the kafka brokers to connect to.  Seems, from the
>> logs, to be a problem when attempting to consume from Kafka itself.
>>
>> I'm guessing that the Kafka 0.9.0 client libraries just are not
>> compatible with Kafka 0.8.2.1 so in order to use Nifi 0.5.0 with Kafka, the
>> Kafka version must be >= 0.9.0.
>>
>> Any change Nifi could add backwards compatible support for Kafka 0.8.2.1
>> too?  Let you choose which client library version, when setting up the
>> GetKafka processor?
>>
>> --
>> Josh West > >
>> Bose Corporation
>>
>>
>>
>> On Sun, 2016-02-21 at 15:02 +, Oleg Zhurakousky wrote:
>>
>> Josh
>>
>> Also, keep in mind that there are incompatible property names in Kafka
>> between the 0.7 and 0.8 releases. One of the change that went it was
>> replacing “zk.connectiontimeout.ms” with “zookeeper.connection.timeout.ms”.
>> Not sure if it’s related though, but realizing that 0.4.1 was relying on
>> this property it’s value was completely ignored with 0.8 client libraries
>> (you could actually see the WARN message to that effect) and now it is not
>> ignored, so take a look and see if tinkering with its value changes
>> something.
>>
>> Cheers
>> Oleg
>>
>> On Feb 20, 2016, at 6:47 PM, Oleg Zhurakousky <
>> ozhurakou...@hortonworks.com
>> > wrote:
>>
>> Josh
>>
>> The only change that ’s went and relevant to your issue is the fact that
>> we’ve upgraded client libraries to Kafka 0.9 and between 0.8 and 0.9 Kafka
>> introduced wire protocol changes that break compatibility.
>> I am still digging so stay tuned.
>>
>> Oleg
>>
>> On Feb 20, 2016, at 4:10 PM, West, Joshua > > wrote:
>>
>> Hi Oleg and Joe,
>>
>> Kafka 0.8.2.1
>>
>> Attached is the app log with hostnames scrubbed.
>>
>> Thanks for your help.  Much appreciated.
>>
>> --
>> Josh West > >
>> Bose Corporation
>>
>>
>>
>> On Sat, 2016-02-20 at 15:46 -0500, Joe Witt wrote:
>>
>> And also what version of Kafka are you using?
>> On Feb 20, 2016 3:37 PM, "Oleg Zhurakousky" > > wrote:
>>
>> Josh
>>
>> Any chance to attache the app-log or relevant stack trace?
>>
>> Thanks
>> Oleg
>>
>> On Feb 20, 2016, at 3:30 PM, West, Joshua > > wrote:
>>
>> Hi folks,
>>
>> I've upgraded from Nifi 0.4.1 to 0.5.0 and I am no longer able to use the
>> GetKafka processor.  I'm seeing errors like so:
>>
>> 2016-02-20 20:10:14,953 WARN
>> [ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0]
>> kafka.consumer.ConsumerFetcherThread
>> [ConsumerFetcherThread-NiFi-sldjflkdsjflksjf_**SCRUBBED**-1455999008728-5b8c7108-0-0],
>> Error in fetchkafka.consumer.ConsumerFetcherThread$FetchRequest@7b49a642
>> .
>> Possible cause: java.lang.IllegalArgumentException
>>
>> ^ Note  the hostname of the server has been scrubbed.
>>
>> My configuration is pretty generic, except that with Zookeeper we use a
>> different root path, so our Zookeeper connect string looks like so:
>>
>> zookeeper-node1:2181,zookeeper-node2:2181,zookeeper-node3:2181/kafka
>>
>> Is anybody else experiencing issues?
>>
>> Thanks.
>>
>> --
>>

Re: Nifi 'as a service'?

2016-02-17 Thread Bryan Bende

Keaton,

You can definitely build a REST service in NiFi! I would take a look at
HandleHttpRequest and HandleHttpResponse.

HandleHttpRequest would be the entry point of your service, the FlowFiles
coming out of this processor would represent the request being made, you
can then perform whatever logic you need and send a response back with
HandleHttpResponse.

Let us know if that doesn't make sense.

Thanks,

Bryan

[1]
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.HandleHttpRequest/index.html
[2]
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.HandleHttpResponse/index.html

On Wed, Feb 17, 2016 at 12:58 PM, Keaton Cleve 
wrote:

> Hi,
>
> Would it be possible to use Nifi 'as a service'? If yes what would be the
> best pattern?
>
> Here is what I have in mind:
>
> I would like to setup a template with different possible predetermined
> destinations. But instead of having predefined sources that I would query
> with a cron like GetFile or GetHDFS, I would like to have a REST API as an
> entry point and users can request to copy a file from any directory into
> one or several of the predetermined destination (this would involve some
> routing to the correct processor I guess). The REST API would only support
> to specify sources which the templates allows, but the specific directory /
> file would be dynamic.
>
> Does that make any sense?

Re: Large dataset on hbase

2016-04-12 Thread Bryan Bende

Hi Prabhu,

How did you end up converting your CSV into JSON?

PutHBaseJSON creates a single row from a JSON document. In your example
above, using n1 as the rowId, it would create a row with columns n2 - n22.
Are you seeing columns missing, or are you missing whole rows from your
original CSV?

Thanks,

Bryan



On Mon, Apr 11, 2016 at 11:43 AM, prabhu Mahendran 
wrote:

> Hi Simon/Joe,
>
> Thanks for this support.
> I have successfully converted the CSV data into JSON and also insert those
> JSON data into Hbase Table using PutHBaseJSon.
> Part of JSON Sample Data like below:
>
> {
> "n1":"",
> "n2":"",
> "n3":"",
> "n4":"","n5":"","n6":"",
> "n7":"",
> "n8":"",
> "n9":"",
>
> "n10":"","n11":"","n12":"","n13":"","n14":"","n15":"","n16":"",
>
> "n17":"","n18":"","n19":"","n20":"","n21":"-",
> "n22":""
>
> }
> PutHBaseJSON:
>Table Name is 'Hike' , Column Family:'Sweet' ,Row
> Identifier Field Name:n1(Element in JSON File).
>
> My Record Contains 15 lacks rows but HBaseTable contains only 10 rows.
> It Can Read the 15 lacks rows but stores minimum rows.
>
> Anyone please help me to solve this?
>
>
>
>
> Prabhu,
>
> If the dataset being processed can be split up and still retain the
> necessary meaning when input to HBase I'd recommend doing that.  NiFI
> itself, as a framework, can handle very large objects because its API
> doesn't force loading of entire objects into memory.  However, various
> processors may do that and I believe ReplaceText may be one that does.
> You can use SplitText or ExecuteScript or other processors to do that
> splitting if that will help your case.
>
> Thanks
> Joe
>
> On Sat, Apr 9, 2016 at 6:35 PM, Simon Ball  wrote:
> > Hi Prabhu,
> >
> > Did you try increasing the heap size in conf/bootstrap.conf? By default
> nifi
> > uses a very small RAM allocation (512MB). You can increase this by
> tweaking
> > java.arg.2 and .3 in the bootstrap.conf file. Note that this is the java
> > heap, so you will need more than your data size to account for java
> object
> > overhead. The other thing to check is the buffer sizes you are using for
> > your replace text processors. If you’re also using Split processors, you
> can
> > sometime run up against RAM and open file limits, if this is the case,
> make
> > sure you increase the ulimit -n settings.
> >
> > Simon
> >
> > On 9 Apr 2016, at 16:51, prabhu Mahendran 
> wrote:
> >
> > Hi,
> >
> > I am new to nifi and does not know how to process large data like one gb
> csv
> > data into hbase.while try combination of getFile and putHbase shell leads
> > Java Out of memory error and also try combination of replace text,
> extract
> > text and puthbasejson doesn't work on large dataset but it work
> correctly in
> > smaller dataset.
> > Can anyone please help me to solve this?
> > Thanks in advance.
> >
> > Thanks & Regards,
> > Prabhu Mahendran
> >
> >
>

Re: Large dataset on hbase

2016-04-12 Thread Bryan Bende

Is the output of your Pig script a single file that contains all the JSON
documents corresponding to your CSV?
or does it create a single JSON document for each row of the CSV?

Also, are there any errors in logs/nifi-app.log (or on the processor in the
UI) when this happens?

-Bryan

On Tue, Apr 12, 2016 at 12:38 PM, prabhu Mahendran <prabhuu161...@gmail.com>
wrote:

> Hi,
>
> I just use Pig Script to convert the CSV into JSON with help of
> ExecuteProcess.
>
> In my case i have use n1 from JSON document which could be stored as row
> key in HBase Table.So n2-n22 store as columns in hbase.
>
> some of rows (n1's) are stored inside the table but remaining are read
> well but not stored.
>
> Thanks,
> Prabhu Mahendran
>
> On Tue, Apr 12, 2016 at 1:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Hi Prabhu,
>>
>> How did you end up converting your CSV into JSON?
>>
>> PutHBaseJSON creates a single row from a JSON document. In your example
>> above, using n1 as the rowId, it would create a row with columns n2 - n22.
>> Are you seeing columns missing, or are you missing whole rows from your
>> original CSV?
>>
>> Thanks,
>>
>> Bryan
>>
>>
>>
>> On Mon, Apr 11, 2016 at 11:43 AM, prabhu Mahendran <
>> prabhuu161...@gmail.com> wrote:
>>
>>> Hi Simon/Joe,
>>>
>>> Thanks for this support.
>>> I have successfully converted the CSV data into JSON and also insert
>>> those JSON data into Hbase Table using PutHBaseJSon.
>>> Part of JSON Sample Data like below:
>>>
>>> {
>>> "n1":"",
>>> "n2":"",
>>> "n3":"",
>>> "n4":"","n5":"","n6":"",
>>> "n7":"",
>>> "n8":"",
>>> "n9":"",
>>>
>>> "n10":"","n11":"","n12":"","n13":"","n14":"","n15":"","n16":"",
>>>
>>> "n17":"","n18":"","n19":"","n20":"","n21":"-",
>>> "n22":""
>>>
>>> }
>>> PutHBaseJSON:
>>>Table Name is 'Hike' , Column Family:'Sweet' ,Row
>>> Identifier Field Name:n1(Element in JSON File).
>>>
>>> My Record Contains 15 lacks rows but HBaseTable contains only 10 rows.
>>> It Can Read the 15 lacks rows but stores minimum rows.
>>>
>>> Anyone please help me to solve this?
>>>
>>>
>>>
>>>
>>> Prabhu,
>>>
>>> If the dataset being processed can be split up and still retain the
>>> necessary meaning when input to HBase I'd recommend doing that.  NiFI
>>> itself, as a framework, can handle very large objects because its API
>>> doesn't force loading of entire objects into memory.  However, various
>>> processors may do that and I believe ReplaceText may be one that does.
>>> You can use SplitText or ExecuteScript or other processors to do that
>>> splitting if that will help your case.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Sat, Apr 9, 2016 at 6:35 PM, Simon Ball <sb...@hortonworks.com>
>>> wrote:
>>> > Hi Prabhu,
>>> >
>>> > Did you try increasing the heap size in conf/bootstrap.conf? By
>>> default nifi
>>> > uses a very small RAM allocation (512MB). You can increase this by
>>> tweaking
>>> > java.arg.2 and .3 in the bootstrap.conf file. Note that this is the
>>> java
>>> > heap, so you will need more than your data size to account for java
>>> object
>>> > overhead. The other thing to check is the buffer sizes you are using
>>> for
>>> > your replace text processors. If you’re also using Split processors,
>>> you can
>>> > sometime run up against RAM and open file limits, if this is the case,
>>> make
>>> > sure you increase the ulimit -n settings.
>>> >
>>> > Simon
>>> >
>>> > On 9 Apr 2016, at 16:51, prabhu Mahendran <prabhuu161...@gmail.com>
>>> wrote:
>>> >
>>> > Hi,
>>> >
>>> > I am new to nifi and does not know how to process large data like one
>>> gb csv
>>> > data into hbase.while try combination of getFile and putHbase shell
>>> leads
>>> > Java Out of memory error and also try combination of replace text,
>>> extract
>>> > text and puthbasejson doesn't work on large dataset but it work
>>> correctly in
>>> > smaller dataset.
>>> > Can anyone please help me to solve this?
>>> > Thanks in advance.
>>> >
>>> > Thanks & Regards,
>>> > Prabhu Mahendran
>>> >
>>> >
>>>
>>
>>
>

Re: ListFile - FetchFile Cluster behavior

2016-03-24 Thread Bryan Bende

Hi Lee,

The List+Fetch model in a cluster is one of the trickier configurations to
set up.

This article has a good description with a diagram under the "pulling
section" that shows ListHDFS+FetchHDFS, but should be the same for
ListFile+FetchFile:

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

The short answer is you would connect ListFile to a Remote Process Group
that points back to the same cluster, and then an Input Port goes to Fetch
File, and it is the Remote Process Group that distributes the data across
the cluster.

Hopefully this helps.

-Bryan


On Thu, Mar 24, 2016 at 4:55 PM, Lee Laim  wrote:

> I'm using the ListFile/FetchFile combination in cluster mode.
>
> When *ListFile is set to run on primary node* and *Fetch File is set
> to default*, The generated flow files only run on  the primary node,
> other nodes sit out.
>
> When *ListFile  and FetchFile is set to run on default* (timer driven),
> They generate flow files which are then consumed by all downstream nodes.
>
> Is this expected behavior? Or is something off with my deployment?
>
> What I am seeing appears to be contrary to the usage description; ListFile
> (primary) generates one list of flow files to organize and distribute work
> to the rest of the cluster.
>
> I'm running 0.5.1 on 3 nodes.
>
> Thanks,
> Lee
>

Re: String conversion to Int, float double

2016-03-24 Thread Bryan Bende

I think the problem is that all attributes are actually Strings internally,
even after calling toNumber() that is only temporary while the expression
language is executing.

So by the time it gets to AttributesToJson it doesn't have any information
about the type of each attribute and they all end up as Strings. I think we
would have to come up with a way to pass some type information along to
AttributesToJson in order to get something other than Strings.

-Bryan

On Thu, Mar 24, 2016 at 3:30 PM, Madhukar Thota 
wrote:

> Hi i am trying to convert string value to integer in UpdateAtrributes
> using toNumber like this
>
>
> ${http.param.t_resp:toNumber()}  where http.param.t_resp = "132"
>
> but when the fileattribute pushed to Attributetojson processor , i am
> stilling seeing it as string. Am i am doing something wrong? and also how
> can i convert string to float?
>
>
>
>
>

Re: ExecuteSQL and NiFi 0.5.1 - Error org.apache.avro.SchemaParseException: Empty name

2016-03-05 Thread Bryan Bende

I think this a legitimate bug that was introduced in 0.5.0.

I created this ticket: https://issues.apache.org/jira/browse/NIFI-1596

For those interested, I think the line of code causing the problem is this:

https://github.com/apache/nifi/blob/0e926074661302c65c74ddee3af183ff49642da7/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/main/java/org/apache/nifi/processors/standard/util/JdbcCommon.java#L133

I think that should be:
  if (!StringUtils.isBlank(tableNameFromMeta))

Have't tried this, but based on the error that was reported it seems like
this could be the problem.

-Bryan

On Sat, Mar 5, 2016 at 4:52 PM, Marcelo Valle Ávila 
wrote:

> Hello Juan,
>
> Thanks for the response,
>
> I deploy a NiFi 0.5.1 clean installation, and the behavior is still there.
> Reading other user mail of the mailing list, it seems that there is some
> incompatibility between NiFi 0.5.x and Oracle databases (maybe more).
>
> With DB2 databases works fine.
>
> Regards
>
> 2016-03-04 19:27 GMT+01:00 Juan Sequeiros :
>
>> I wonder if on the controller service DBCPConnectionPool associated to
>> your ExecuteSQL processor you have something that can't be found since it's
>> stored on your older release.
>>
>>
>> On Fri, Mar 4, 2016 at 11:12 AM, Marcelo Valle Ávila 
>> wrote:
>>
>>> Hello community,
>>>
>>> I'm starting my first steps with NiFi, and enjoining how it works!
>>>
>>> I started with version 0.4.1 and a simple flow:
>>>
>>> ExecuteSQL -> ConvertAvroToJSON -> PutEventHub
>>>
>>> Reading from an Oracle database, and everything works like a charm!
>>>
>>> Few days ago NiFi 0.5.1 has been released, and I tried a rolling
>>> upgrade, using my old NiFi flow. The update goes right and my flow is
>>> loaded correctly.
>>>
>>> The problem is when I starts the ExecuteSQL processor, it doesn't
>>> works... In log file I can see this trace:
>>>
>>> ERROR [Timer-Driven Process Thread-8]
>>> o.a.nifi.processors.standard.ExecuteSQL
>>> org.apache.avro.SchemaParseException: Empty name
>>> at org.apache.avro.Schema.validateName(Schema.java:1076) ~[na:na]
>>> at org.apache.avro.Schema.access$200(Schema.java:79) ~[na:na]
>>> at org.apache.avro.Schema$Name.(Schema.java:436) ~[na:na]
>>> at org.apache.avro.Schema.createRecord(Schema.java:145) ~[na:na]
>>> at
>>> org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1732)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.processors.standard.util.JdbcCommon.createSchema(JdbcCommon.java:138)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.processors.standard.util.JdbcCommon.convertToAvroStream(JdbcCommon.java:72)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.processors.standard.ExecuteSQL$1.process(ExecuteSQL.java:158)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.controller.repository.StandardProcessSession.write(StandardProcessSession.java:1953)
>>> ~[nifi-framework-core-0.5.1.jar:0.5.1]
>>> at
>>> org.apache.nifi.processors.standard.ExecuteSQL.onTrigger(ExecuteSQL.java:152)
>>> ~[na:na]
>>> at
>>> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
>>> ~[nifi-api-0.5.1.jar:0.5.1]
>>> at
>>> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1139)
>>> [nifi-framework-core-0.5.1.jar:0.5.1]
>>> at
>>> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:139)
>>> [nifi-framework-core-0.5.1.jar:0.5.1]
>>> at
>>> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:49)
>>> [nifi-framework-core-0.5.1.jar:0.5.1]
>>> at
>>> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:124)
>>> [nifi-framework-core-0.5.1.jar:0.5.1]
>>> at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>>> [na:1.7.0_79]
>>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>>> [na:1.7.0_79]
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
>>> [na:1.7.0_79]
>>> at
>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>> [na:1.7.0_79]
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> [na:1.7.0_79]
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> [na:1.7.0_79]
>>> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
>>>
>>> I tried with a clean installation of NiFi 0.5.1 and new clean flow, but
>>> the error stills appears, and the processor doesn't starts.
>>>
>>> With a downgrade to NiFi 0.4.1 the processor works perfectly.
>>>
>>> Do you have any idea of what can be failing?
>>> Do you think I'm doing something wrong?
>>>
>>> Thanks in advance!
>>> Marcelo
>>>
>>
>>
>>
>> --
>> Juan Carlos Sequeiros
>>
>
>

Re: PutMongo Processor

2016-03-05 Thread Bryan Bende

Uwe,

Personally I don't have that much experience with MongoDB, but the
additional functionality you described sounds like something we would want
to support.  Looking through JIRA I only see one ticket related to MongoDB
to add SSL support [1] so I think it would be great to create a new JIRA to
capture these ideas and any specifics about how you envision it working.

Thanks,

Bryan

[1] https://issues.apache.org/jira/browse/NIFI-1197

On Fri, Mar 4, 2016 at 6:43 PM, Uwe Geercken  wrote:

> Hello,
>
> I have tried to use the PutMongo processor. I read Json files from a
> folder and send them to a mongo database.
>
> The insert of documents works seamlessly.
>
> Next I tested updates. The problem here is that for the update a complete
> document is required. So if you insert a document with 5 key/value pairs
> and then make an update using a Json with only 3 key/value pairs then 2
> fields are gone after the update. This is a known behavior in mongodb. But
> mongodb also supports doing updates using the $set operator. In this case
> the key/values pairs procided get updated and the others remain untouched.
> But the nifi processor does not seem to support this technique.
>
> I then went on to test upserts. Same problem here as upserts are basically
> the same as updates. But I laso found out that the processor does not
> support some of the other more enhanced operators such as $push (push a
> key/value into an array) or the $inc operator (increment).
>
> The error message from the nifi log says:
> java.lang.IllegalArgumentException: Invalid BSON field name $push - for
> example.
>
> So I think there is currently only limited support for mongodb. But I hope
> this will be enhanced in the near future taking the importance of mongodb.
>
> What is your experiece with this processor? Or maybe you have a workaround?
>
> Rgds,
>
> Uwe
>
>
>

Re: Advance usage documentation not displayed

2016-03-05 Thread Bryan Bende

Russell,

Just want to confirm what you are seeing... so when you bring up the usage
for your processor, you see the normal documentation, but you don't see an
 "Additional Details..." link at the top of the page?

One example I know of is the PutSolrContentStream processor:

https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.solr.PutSolrContentStream/index.html
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.solr.PutSolrContentStream/additionalDetails.html

-Bryan

On Fri, Mar 4, 2016 at 4:39 PM, Russell Bateman <
russell.bate...@perfectsearchcorp.com> wrote:

> Just getting back to this...
>
> I have so far been unable to get the Advanced "Usage" documentation
> feature to work in any of my processors. Whether I right-click on the
> processor in the workspace and choose Usage or click Help in the workspace,
> I get nothing that resembles or contains what I've got in
> *additionalDetails.html*.
>
> The most important bits of my processor are illustrated below. I think
> they match the documentation in
>
>
> https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#advanced-documentation
>
>
> .
> ├── pom.xml
> └── *src*
> └── *main*
> ├── *java*
> │   └── *com*
> │   └── *imatsolutions*
> │   └── *nifi*
> │└── *processor*
> │└── *AppointmentsProcessor*.java
> └── *resources*
> ├── *docs*
> │   └──
> *com.imatsolutions.nifi.processor.AppointmentsProcessor*
> │   └── additionalDetails.html
> └── META-INF
>  └── services
>  └── org.apache.nifi.processor.Processor
>
>
> The nifi-nar-maven-plugin appears to put this where it belongs:
>
>
>

Re: No controller service types found that are applicable for this property

2016-03-31 Thread Bryan Bende

I suspect this is a dependency problem with the way the NAR was built.

How did you create the new project structure for your copied PutHBaseJSON?
You would need the same dependencies that nifi-hbase-bundle has...

A provided dependency on the hbase client service API in your processors
pom:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-hbase-bundle/nifi-hbase-processors/pom.xml#L29

A NAR dependency on the nifi-standard-services-api-nar in your NAR's pom:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-hbase-bundle/nifi-hbase-nar/pom.xml#L32

Let us know if this helps.

-Bryan


On Thu, Mar 31, 2016 at 9:08 AM, Matt Burgess  wrote:

> Rajeshkumar,
>
> Since this is likely related to the custom processor code (either as a
> result of redeclaring/overriding the existing controller service or NAR
> dependencies or something else), you may find the d...@nifi.apache.org
> mailing list has an audience better suited to help you. In either case, can
> you supply more information, such as a stack trace (if there is one), or
> any information you have learned while debugging (such as the line of code
> that throws the error)?
>
> Thanks,
> Matt
>
> On Thu, Mar 31, 2016 at 8:02 AM, Rajeshkumar J <
> rajeshkumarit8...@gmail.com> wrote:
>
>> Hi
>>
>>   I am new to Apache Nifi and i try to debug Nifi processors using remote
>> debugger. So I have created a new custom processor by copying the source
>> code of putHBaseJSON processor and build it using maven and then copied mar
>> file to lib folder and restarted it. Then I try to configure the property
>> for *HBase Client Service* so I choose *create new service* option then
>> it throws the following error
>>
>> *No controller service types found that are applicable for this property*
>>
>> Please correct me if I am wrong and help me resolving this
>>
>> thanks
>>
>
>

Re: Create row keys for HBase from Json messages

2016-03-21 Thread Bryan Bende

Hong,

Glad to hear you are getting started with NiFi! What do your property names
look like on EvaluatJsonPath?

Typically if you wanted to extract the effective timestamp, event id, and
applicant id from your example json, then you would add properties to
EvaluateJsonPath like the following:

effectiveTimestamp =  $.effectiveTimestamp
eventId = $.event.id
eventApplicantId = $.event.applicant.id

Then in PutHBaseJson if you want the Row Id to be the event id, followed by
applicant id, followed by timestamp, you could do:

${eventId}_${eventApplicantId}_${effectiveTimestamp}

The above expression with your sample JSON should give you:

1e9b91398160471f8b6197ad974e2464_1f4a3862fab54e058305e3c73cc13dd3_
2015-12-03T23:17:29.874Z

Now if you wanted to timestamp to be the long representation instead of the
date string, you could do:

${eventId}_${eventApplicantId}_${effectiveTimestamp:
toDate("-MM-dd'T'HH:mm:ss.SSS'Z'"):toNumber()}

Let us know if this helps.

-Bryan


On Mon, Mar 21, 2016 at 7:54 PM, Hong Li 
wrote:

> I'm a new user for Nifi, and just started my first Nifi project, where we
> need to move Json messages into HBase.  After I read the templates and user
> guide, I see I still need help to learn how to concatenate the values
> pulled out from the Json messages to form a unique row key for HBase tables.
>
> Given the sample message below, I run into errors where I need to create
> the unique keys for HBase by concatenating values pulled from the messages.
>
> {
> "effectiveTimestamp": "2015-12-03T23:17:29.874Z",
> "event": {
> "@class": "events.policy.PolicyCreated",
> "id": "1e9b91398160471f8b6197ad974e2464",
> "ipAddress": "10.8.30.145",
> "policy": {
> "additionalListedInsureds": [],
> "address": {
> "city": "Skokie",
> "county": "Cook",
> "id": "b863190a5bf846858eb372fb5f532fe7",
> "latitude": 42.0101,
> "longitude": -87.75354,
> "state": "IL",
> "street": "5014 Estes Ave",
> "zip": "60077-3520"
> },
> "applicant": {
> "age": 36,
> "birthDate": "1979-01-12",
> "clientId": "191",
> "creditReport": {
> "id": "ca5ec932d33d444b880c9a43a6eb7c50",
> "reasons": [],
> "referenceNumber": "15317191300474",
> "status": "NoHit"
> },
> "firstName": "Kathy",
> "gender": "Female",
> "id": "1f4a3862fab54e058305e3c73cc13dd3",
> "lastName": "Bockett",
> "maritalStatus": "Single",
> "middleName": "Sue",
> "ssn": "***"
> },
> "channelOfOrigin": "PublicWebsite",
> ... ...
>
> For example, in processor EvaluateJsonPath, I could pull out individual
> values as shown below:
>
> $.effectiveTimestamp
> $.event.id
> $.event.applicant.id
>
>
> However, when I tried to create the HBase row key there such as
>
> ${allAttributes($.event.id, 
> $event.applicant.id):join($.effectiveTimestamp:toDate('MMM
> d HH:mm:ss'):toString()}_${uuid}
>
>
> I could not make it work no matter how I modified or simplified the long
> string.  I must have misunderstood something here.  I don't know if this
> question has already been asked and answered.
>
> Thank you for your help.
> Hong
>
>
> *Hong Li*
>
> *Centric Consulting*
>
> *In Balance*
> (888) 781-7567 office
> (614) 296-7644 mobile
> www.centricconsulting.com | @Centric 
>

Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

2016-04-21 Thread Bryan Bende

Hello,

I believe this example shows an approach to do it (it includes Hive even
though the title is Solr/banana):
https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html

The short version is that it extracts several attributes from each tweet
using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile
content with a pipe delimited string of those attributes, and then creates
a Hive table that knows how to handle that delimiter. With this approach
you don't need to set the header, footer, and demarcator in MergeContent.

create table if not exists tweets_text_partition(
tweet_id bigint,
created_unixtime bigint,
created_time string,
displayname string,
msg string,
fulltext string
)
row format delimited fields terminated by "|"
location "/tmp/tweets_staging";

-Bryan


On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov 
wrote:

> Hi guys,
>
> I want to create a following workflow:
>
> 1.Fetch tweets using GetTwitter processor.
> 2.Merge tweets in a bigger file using MergeContent process.
> 3.Store merged files in HDFS.
> 4. On the hadoop/hive side I want to create an external table based on
> these tweets.
>
> There are examples how to do this tbut what I am missing is how to
> configure MergeContent processor: what to set as header,footer and
> demarcator. And what to use on on hive side as separator so thatit will
> split merged tweets in rows. Hope I described myself clearly.
>
> Thanks in advance.
>

Re: Apache NiFi/Hive - store merged tweets in HDFS, create table in hive

2016-04-21 Thread Bryan Bende

Also, this blog has a picture of what I described with MergeContent:

https://blogs.apache.org/nifi/entry/indexing_tweets_with_nifi_and

-Bryan

On Thu, Apr 21, 2016 at 4:37 PM, Bryan Bende <bbe...@gmail.com> wrote:

> Hi Igor,
>
> I don't know that much about Hive so I can't really say what format it
> needs to be in for Hive to understand it.
>
> If it needs to be a valid array of JSON documents, in MergeContent change
> the Delimiter Strategy to "Text" which means it will use whatever values
> you type directly into Header, Footer, Demarcator, and then specify [ ] ,
>  respectively as the values.
>
> That will get you something like this where {...} are the incoming
> documents:
>
> [
> {...},
> {...},
> ]
>
> -Bryan
>
>
> On Thu, Apr 21, 2016 at 4:06 PM, Igor Kravzov <igork.ine...@gmail.com>
> wrote:
>
>> Hi Brian,
>>
>> I am aware of this example. But I want to store JSON as it is and create
>> external table. Like in this example.
>> http://hortonworks.com/blog/howto-use-hive-to-sqlize-your-own-tweets-part-two-loading-hive-sql-queries/
>> What I don't know is how to properly merge multiple JSON in one file in
>> order for hive to read it properly.
>>
>> On Thu, Apr 21, 2016 at 2:33 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> I believe this example shows an approach to do it (it includes Hive even
>>> though the title is Solr/banana):
>>>
>>> https://community.hortonworks.com/articles/1282/sample-hdfnifi-flow-to-push-tweets-into-solrbanana.html
>>>
>>> The short version is that it extracts several attributes from each tweet
>>> using EvaluateJsonPath, then uses ReplaceText to replace the FlowFile
>>> content with a pipe delimited string of those attributes, and then creates
>>> a Hive table that knows how to handle that delimiter. With this approach
>>> you don't need to set the header, footer, and demarcator in MergeContent.
>>>
>>> create table if not exists tweets_text_partition(
>>> tweet_id bigint,
>>> created_unixtime bigint,
>>> created_time string,
>>> displayname string,
>>> msg string,
>>> fulltext string
>>> )
>>> row format delimited fields terminated by "|"
>>> location "/tmp/tweets_staging";
>>>
>>> -Bryan
>>>
>>>
>>> On Thu, Apr 21, 2016 at 1:52 PM, Igor Kravzov <igork.ine...@gmail.com>
>>> wrote:
>>>
>>>> Hi guys,
>>>>
>>>> I want to create a following workflow:
>>>>
>>>> 1.Fetch tweets using GetTwitter processor.
>>>> 2.Merge tweets in a bigger file using MergeContent process.
>>>> 3.Store merged files in HDFS.
>>>> 4. On the hadoop/hive side I want to create an external table based on
>>>> these tweets.
>>>>
>>>> There are examples how to do this tbut what I am missing is how to
>>>> configure MergeContent processor: what to set as header,footer and
>>>> demarcator. And what to use on on hive side as separator so thatit will
>>>> split merged tweets in rows. Hope I described myself clearly.
>>>>
>>>> Thanks in advance.
>>>>
>>>
>>>
>>
>

Re: Question on setting up nifi flow

2016-04-28 Thread Bryan Bende

Hi Susheel,

In addition to what Pierre mentioned, if you are interested in an example
of using HandleHttpRequest/Response, there is a template in this repository:

https://github.com/hortonworks-gallery/nifi-templates

The template is HttpExecuteLsCommand.xml and shows how to build a web
service in NiFi that performs a directory listing.

-Bryan


On Thu, Apr 28, 2016 at 11:19 AM, Pierre Villard <
pierre.villard...@gmail.com> wrote:

> Hi Susheel,
>
> 1. HandleHttpRequest
> 2. RouteOnAttribute + HandleHttpResponse in case of errors detected in
> headers
> 3. Depending of what you want, there are a lot of options to handle JSON
> data (EvaluateJsonPath will probably useful)
> 4. GetMongo (I think it will route on success in case there is an entry,
> and to failure if there is no record, but this has to be checked, otherwise
> an addional processor will do the job to check the result of the request).
> 5. & 6. PutMongo + PutFile (if local folder) + PutSolr (if you want to do
> Solr by yourself).
>
> Depending of the details, this could be slightly different, but I think it
> gives a good idea of the minimal set of processor you would need.
>
> HTH,
> Pierre
>
>
> 2016-04-28 16:54 GMT+02:00 Susheel Kumar :
>
>> Hi,
>>
>> After attending meetup in NYC, I am realizing NiFi can be used for the
>> data flow use case I have.  Can someone please share the steps/processors
>> necessary for below use case.
>>
>>
>>1. Receive JSON on a HTTP REST end point
>>2. Parse Http Header and do validation. Return Error code & messages
>>as JSON to the response in case of validation failures
>>3. Parse request JSON, perform various validations (missing data in
>>fields), massages some data, add some data
>>4. Check if the request JSON unique ID is present in MongoDB and
>>compare timestamp to validate if this is an update request or a new 
>> request
>>5. If new request, an entry is made in mongo and then JSON files are
>>written to output folder for another process to pick up and submit to 
>> Solr.
>>6. If update request, mongo record is updated and JSON files are
>>written to output folder
>>
>>
>> I understand that something like HandleHttpRequest Processor can be used
>> for receiving http request and then use PutSolrContentStream for writing to
>> Solr but not clear on what processors will be used for validation etc.
>> steps 2 thru 5 above.
>>
>> Appreciate your input.
>>
>> Thanks,
>> Susheel
>>
>>
>>
>>
>>
>

Re: Spark & NiFi question

2016-05-20 Thread Bryan Bende

Hi Conrad,

Sorry this has been so challenging to setup. After trying it out myself, I
believe the problem you ran into when you didn't set the System properties
is actually a legit bug in the SiteToSiteClient...
I wrote it up in this JIRA [1], but the short answer is that it never uses
those properties to create an SSLContext and ends up trying to make a
normal connection to the https end-point, and thus ends up failing.

I made some quick code changes to work around the above issue, and
eventually got it working using Storm, since I don't have spark streaming
setup. Here is what I did...

In conf/nifi.properties I set the following:

# Site to Site properties
nifi.remote.input.socket.host=
nifi.remote.input.socket.port=8088
nifi.remote.input.secure=true

# web properties #
nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=8080
nifi.web.https.host=
nifi.web.https.port=8443
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200

# security properties #
nifi.sensitive.props.key=
nifi.sensitive.props.algorithm=PBEWITHMD5AND256BITAES-CBC-OPENSSL
nifi.sensitive.props.provider=BC

nifi.security.keystore=/path/to/nifi/nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/resources/localhost-ks.jks
nifi.security.keystoreType=JKS
nifi.security.keystorePasswd=localtest
nifi.security.keyPasswd=localtest
nifi.security.truststore=/path/to/nifi//nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/resources/localhost-ts.jks
nifi.security.truststoreType=JKS
nifi.security.truststorePasswd=localtest

I started NiFi and used the unsecure url (http://localhost:8080/nifi)  to
create a flow with GenerateFlowFile -> Output Port named "Data for Storm".

There is an example Storm topology that is part of the code base [2], so I
started with that, and modified the SiteToSiteClientConfig:

final SiteToSiteClientConfig inputConfig = new SiteToSiteClient.Builder()
.url("https://localhost:8443/nifi;)
.portName("Data for Storm")

.keystoreFilename("/path/to/nifi//nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/resources/localhost-ks.jks")
.keystoreType(KeystoreType.JKS)
.keystorePass("localtest")

.truststoreFilename("/path/to/nifi//nifi-nar-bundles/nifi-standard-bundle/nifi-standard-processors/src/test/resources/localhost-ts.jks")
.truststoreType(KeystoreType.JKS)
.truststorePass("localtest")
.buildConfig();

Now of course setting those properties only worked because of local changes
I made, but after that I got a 401 Unauthorized when I ran the topology,
which I think was where you were originally at.

I went back into the unsecure url and checked the users section and didn't
see anything, so I think I was incorrect that it automatically creates a
pending account.
I then put that localhost cert into my browser (I already had it as p12
from something else) and I went to https://localhost:8443/nifi and it
prompted for the account request and I submitted it.
Went back to the unsecure UI and approved the account with role NiFi, then
went to the Output Port and gave access to the localhost user.

After that it was working... I think since you were already at the point of
getting the 401, if you can just get the account created for that
certificate and the access controls on the ports, then it should probably
work using the System properties as a work around for now, but not totally
sure.

Again, sorry for all the confusion, definitely planning to address the JIRA
soon.

-Bryan

[1] https://issues.apache.org/jira/browse/NIFI-1907
[2]
https://github.com/apache/nifi/blob/e12a79ea929a222a93fd64bfc63382441e31060f/nifi-external/nifi-storm-spout/src/test/java/org/apache/nifi/storm/NiFiStormTopology.java

On Fri, May 20, 2016 at 4:16 AM, Conrad Crampton <
conrad.cramp...@secdata.com> wrote:

> Thanks for the pointers Bryan, however wrt your first suggestion. I tried
> without setting SSL properties on System properties and get an unable to
> find ssl path error – this gets resolved by doing as I have done (but of
> course this may be a red herring). I initially tried setting on site
> builder but got the same error as below – it appears to make no difference
> as to what is logged in the nifi-users.log if I include SSL props on site
> builder or not, I get the same error viz:
>
> 2016-05-20 08:59:47,082 INFO [NiFi Web Server-29590180]
> o.a.n.w.s.NiFiAuthenticationFilter Attempting request for
>

Re: Spark & NiFi question

2016-05-23 Thread Bryan Bende

Conrad,

I think the error message is mis-leading a little bit, it says...

"Unable to communicate with yarn-cm1.mis-cds.local:9870 because it requires
Secure Site-to-Site communications, but this instance is not configured for
secure communications"

That statement is saying that your NiFi cluster is configured for secure
site-to-site (which you proved from the debug logs), but that "this
instance" which is actually your Spark streaming job, is not configured for
secure communication.
The reason it thinks your Spark streaming job is not configured for secure
communication is because of the bug I mentioned in the previous email,
where it will never create the SSLContext.

The error message was originally written in the context of two NiFi
instances talking to each other, so it makes more sense in that context.
Perhaps it should be changed to... "this site-to-site client is not
configured for secure communication".

-Bryan


On Mon, May 23, 2016 at 11:04 AM, Conrad Crampton <
conrad.cramp...@secdata.com> wrote:

> Hi,
> I don’t know if I’m hitting some bug here but something doesn’t make sense.
> With ssl debug on I get the following
> NiFi Receiver, READ: TLSv1.2 Application Data, length = 1648
> Padded plaintext after DECRYPTION:  len = 1648
> : 65 A2 B8 34 DF 20 6B 95   56 88 97 16 7A EC 8F E3  e..4. k.V...z...
> 0010: 48 54 54 50 2F 31 2E 31   20 32 30 30 20 4F 4B 0D  HTTP/1.1 200 OK.
> 0020: 0A 44 61 74 65 3A 20 4D   6F 6E 2C 20 32 33 20 4D  .Date: Mon, 23 M
> 0030: 61 79 20 32 30 31 36 20   31 34 3A 34 39 3A 33 39  ay 2016 14:49:39
> 0040: 20 47 4D 54 0D 0A 53 65   72 76 65 72 3A 20 4A 65   GMT..Server: Je
> 0050: 74 74 79 28 39 2E 32 2E   31 31 2E 76 32 30 31 35  tty(9.2.11.v2015
> 0060: 30 35 32 39 29 0D 0A 43   61 63 68 65 2D 43 6F 6E  0529)..Cache-Con
> 0070: 74 72 6F 6C 3A 20 70 72   69 76 61 74 65 2C 20 6E  trol: private, n
> 0080: 6F 2D 63 61 63 68 65 2C   20 6E 6F 2D 73 74 6F 72  o-cache, no-stor
> 0090: 65 2C 20 6E 6F 2D 74 72   61 6E 73 66 6F 72 6D 0D  e, no-transform.
> 00A0: 0A 56 61 72 79 3A 20 41   63 63 65 70 74 2D 45 6E  .Vary: Accept-En
> 00B0: 63 6F 64 69 6E 67 2C 20   55 73 65 72 2D 41 67 65  coding, User-Age
> 00C0: 6E 74 0D 0A 44 61 74 65   3A 20 4D 6F 6E 2C 20 32  nt..Date: Mon, 2
> 00D0: 33 20 4D 61 79 20 32 30   31 36 20 31 34 3A 34 39  3 May 2016 14:49
> 00E0: 3A 33 39 20 47 4D 54 0D   0A 43 6F 6E 74 65 6E 74  :39 GMT..Content
> 00F0: 2D 54 79 70 65 3A 20 61   70 70 6C 69 63 61 74 69  -Type: applicati
> 0100: 6F 6E 2F 6A 73 6F 6E 0D   0A 56 61 72 79 3A 20 41  on/json..Vary: A
> 0110: 63 63 65 70 74 2D 45 6E   63 6F 64 69 6E 67 2C 20  ccept-Encoding,
> 0120: 55 73 65 72 2D 41 67 65   6E 74 0D 0A 43 6F 6E 74  User-Agent..Cont
> 0130: 65 6E 74 2D 4C 65 6E 67   74 68 3A 20 31 32 38 35  ent-Length: 1285
> 0140: 0D 0A 0D 0A 7B 22 72 65   76 69 73 69 6F 6E 22 3A  ."revision":
> 0150: 7B 22 63 6C 69 65 6E 74   49 64 22 3A 22 39 34 38  ."clientId":"948
> 0160: 66 62 34 31 33 2D 65 39   37 64 2D 34 32 37 65 2D  fb413-e97d-427e-
> 0170: 61 34 38 36 2D 31 31 63   39 65 37 31 63 63 62 62  a486-11c9e71ccbb
> 0180: 32 22 7D 2C 22 63 6F 6E   74 72 6F 6C 6C 65 72 22  2".,"controller"
> 0190: 3A 7B 22 69 64 22 3A 22   31 38 63 38 39 64 32 33  :."id":"18c89d23
> 01A0: 2D 61 35 31 65 2D 34 35   35 38 2D 62 30 31 61 2D  -a51e-4558-b01a-
> 01B0: 33 66 36 30 64 66 31 31   63 39 61 64 22 2C 22 6E  3f60df11c9ad","n
> 01C0: 61 6D 65 22 3A 22 4E 69   46 69 20 46 6C 6F 77 22  ame":"NiFi Flow"
> 01D0: 2C 22 63 6F 6D 6D 65 6E   74 73 22 3A 22 22 2C 22  ,"comments":"","
> 01E0: 72 75 6E 6E 69 6E 67 43   6F 75 6E 74 22 3A 31 36  runningCount":16
> 01F0: 34 2C 22 73 74 6F 70 70   65 64 43 6F 75 6E 74 22  4,"stoppedCount"
> 0200: 3A 34 33 2C 22 69 6E 76   61 6C 69 64 43 6F 75 6E  :43,"invalidCoun
> 0210: 74 22 3A 31 2C 22 64 69   73 61 62 6C 65 64 43 6F  t":1,"disabledCo
> 0220: 75 6E 74 22 3A 30 2C 22   69 6E 70 75 74 50 6F 72  unt":0,"inputPor
> 0230: 74 43 6F 75 6E 74 22 3A   37 2C 22 6F 75 74 70 75  tCount":7,"outpu
> 0240: 74 50 6F 72 74 43 6F 75   6E 74 22 3A 31 2C 22 72  tPortCount":1,"r
> 0250: 65 6D 6F 74 65 53 69 74   65 4C 69 73 74 65 6E 69  emoteSiteListeni
> 0260: 6E 67 50 6F 72 74 22 3A   39 38 37 30 2C 22 73 69  ngPort":9870,"si
> 0270: 74 65 54 6F 53 69 74 65   53 65 63 75 72 65 22 3A  teToSiteSecure":
> 0280: 74 72 75 65 2C 22 69 6E   73 74 61 6E 63 65 49 64  true,"instanceId
> 0290: 22 3A 22 30 35 38 30 63   35 31 38 2D 39 62 63 37  ":"0580c518-9bc7
> 02A0: 2D 34 37 38 33 2D 39 32   34 38 2D 35 38 30 61 36  -4783-9248-580a6
> 02B0: 37 34 65 34 33 35 62 22   2C 22 69 6E 70 75 74 50  74e435b","inputP
> 02C0: 6F 72 74 73 22 3A 5B 7B   22 69 64 22 3A 22 33 32  orts":[."id":"32
> 02D0: 37 30 39 33 31 66 2D 64   61 38 35 2D 34 63 34 65  70931f-da85-4c4e
> 02E0: 2D 62 61 65 36 2D 38 63   36 32 37 62 30 39 62 37  -bae6-8c627b09b7
> 02F0: 32 66 22 2C 22 6E 61 6D   65 22 3A 22 48 44 46 53  2f","name":"HDFS
> 0300: 49 6E 63 6F 6D 69 6E 67   22 2C 22 63 6F 6D 6D 65  Incoming","comme
> 0310: 6E

Re: Spark & NiFi question

2016-05-19 Thread Bryan Bende

Hi Conrad,

I think there are a couple of things at play here...

One is that the SSL properties need to be set on the
SiteToSiteClientBuilder, rather than through system properties. There
should be methods to set the keystore and other values.

In a secured NiFi instance, the certificate you are authenticating with
(the keystore used by the s2s client) would need to have an account in
NiFi, and would need to have access to the output port.
If you attempt to make a request with that cert, and then you go into the
NiFi UI as another user, you should be able to go into the accounts section
(top right) and approve the account for that certificate.

Then if you stop your output port, right-click and Configure... and from
the Access Controls tab started typing the DN from your cert and add that
user to the Allowed Users list. Hit Apply and started the port again.

We probably need to document this better, or write up an article about it
somewhere.

Let us know if its still not working.

Thanks,

Bryan


On Thu, May 19, 2016 at 11:54 AM, Conrad Crampton <
conrad.cramp...@secdata.com> wrote:

> Hi,
> Tried following a couple of blog posts about this [1], [2], but neither of
> these refer to using NiFi in clustered environment with SSL and I suspect
> this is where I am hitting problems (but don’t know where).
>
> The blogs state that using an output port (in the root process group I.e.
> on main canvas) which I have done and tried to connect thus..
>
> System.setProperty("javax.net.ssl.keyStore", "/spark-processor.jks");
> System.setProperty("javax.net.ssl.keyStorePassword", *“**");
> System.setProperty("javax.net.ssl.trustStore", *“*/cacerts.jks");
>
> SiteToSiteClientConfig config = new SiteToSiteClient.Builder()
> .url("https://yarn-cm1.mis-cds.local:9090/nifi;)
> .portName("Spark test out")
> .buildConfig();
>
> SparkConf sparkConf = new SparkConf().setMaster("local[2]").setAppName("NiFi 
> Spark Log Processor");
> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, new 
> Duration(5000));
> JavaReceiverInputDStream packetStream = 
> jssc.receiverStream(new NiFiReceiver(config, StorageLevel.MEMORY_ONLY()));
>
> JavaDStream text = packetStream.map(dataPacket -> new 
> String(dataPacket.getContent(), StandardCharsets.UTF_8));
> text.print();
> jssc.start();
> jssc.awaitTermination();
>
> The error I am getting is
>
> 16/05/19 16:39:03 WARN ReceiverSupervisorImpl: Restarting receiver with
> delay 2000 ms: Failed to receive data from NiFi
> java.io.IOException: Server returned HTTP response code: 401 for URL:
> https://yarn-cm1.mis-cds.local:9090/nifi-api/controller
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at
> sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1889)
> at
> sun.net.www.protocol.http.HttpURLConnection$10.run(HttpURLConnection.java:1884)
> at java.security.AccessController.doPrivileged(Native Method)
> at
> sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1883)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1456)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
> at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
> at
> org.apache.nifi.remote.util.NiFiRestApiUtil.getController(NiFiRestApiUtil.java:69)
> at
> org.apache.nifi.remote.client.socket.EndpointConnectionPool.refreshRemoteInfo(EndpointConnectionPool.java:891)
> at
> org.apache.nifi.remote.client.socket.EndpointConnectionPool.getPortIdentifier(EndpointConnectionPool.java:878)
> at
> org.apache.nifi.remote.client.socket.EndpointConnectionPool.getOutputPortIdentifier(EndpointConnectionPool.java:862)
> at
> org.apache.nifi.remote.client.socket.SocketClient.getPortIdentifier(SocketClient.java:81)
> at
> org.apache.nifi.remote.client.socket.SocketClient.createTransaction(SocketClient.java:123)
> at
> org.apache.nifi.spark.NiFiReceiver$ReceiveRunnable.run(NiFiReceiver.java:149)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Server returned HTTP response code: 401
> for URL: https://yarn-cm1.mis-cds.local:9090/nifi-api/controller
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1839)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1440)
> at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
> at
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
> at
>

Re: Spark & NiFi question

2016-05-23 Thread Bryan Bende

Conrad,

Unfortunately I think this is a result of the issue you discovered with the
SSLContext not getting created from the properties on the
SiteToSiteClientBuilder...

Whats happening is the spark side is hitting this:

if (siteToSiteSecure) {
if (sslContext == null) {
throw new IOException("Unable to communicate with " + hostname +
":" + port
+ " because it requires Secure Site-to-Site communications,
but this instance is not configured for secure communications");
}

And siteToSiteSecure is true, but the sslContext is null so it can never
get past this point. I submitted a pull request on Friday that should
address the issue [1].

Once we get this merged in you could possibly build the source to get the
fixed SiteToSiteClient code, otherwise you could wait for the 0.7.0 release
to happen.

-Bryan

[1] https://github.com/apache/nifi/pull/457

On Mon, May 23, 2016 at 5:39 AM, Conrad Crampton <
conrad.cramp...@secdata.com> wrote:

> Hi,
> An update to this but still not working
> I have now set keystore and truststore as system properties, and included
> these as part of the SiteToSiteClientConfig building. I have used a cert
> that I have for one of the servers in my cluster as I know they can
> communicate over ssl with NCM as my 6 node cluster works over ssl and has
> remote ports working (as I read from syslog on a primary server then
> distribute to other via remote ports as suggested somewhere else) .
> When I try now to connect to output port via Spark, I get a
> "EndpointConnectionPool[Cluster URL=
> https://yarn-cm1.mis-cds.local:9090/nifi/] Unable to refresh Remote
> Group's peers due to java.io.IOException: Unable to communicate with
> yarn-cm1.mis-cds.local:9870 because it requires Secure Site-to-Site
> communications, but this instance is not configured for secure
> communications"
> Exception even though I know Secure Site-to-Site communication is working
> (9870 being the port set up for remote s2s comms in nifi.properties), so I
> am now really confused!!
>
> Does the port that I wish to read from need to be set up with remote
> process group (conceptually I’m struggling with how to do this for an
> output port), or is it is sufficient to be ‘just an output port’?
>
> I have this working when connecting to an unsecured (http) instance of
> NiFi running on my laptop with Spark and a standard output port. Does it
> make a difference that my production cluster is a cluster and therefore
> needs setting up differently?
>
> So many questions but I’m stuck now so any suggestions welcome.
> Thanks
> Conrad
>
> From: Conrad Crampton 
> Reply-To: "users@nifi.apache.org" 
> Date: Friday, 20 May 2016 at 09:16
> To: "users@nifi.apache.org" 
> Subject: SPOOFED: Re: Spark & NiFi question
>
> Thanks for the pointers Bryan, however wrt your first suggestion. I tried
> without setting SSL properties on System properties and get an unable to
> find ssl path error – this gets resolved by doing as I have done (but of
> course this may be a red herring). I initially tried setting on site
> builder but got the same error as below – it appears to make no difference
> as to what is logged in the nifi-users.log if I include SSL props on site
> builder or not, I get the same error viz:
>
> 2016-05-20 08:59:47,082 INFO [NiFi Web Server-29590180]
> o.a.n.w.s.NiFiAuthenticationFilter Attempting request for
>

Re: Build a CSV file using MergeContent processor

2016-05-12 Thread Bryan Bende

When using the "text" strategy you may have to do shift+enter in the
demarcator field to create a new line.

On Thu, May 12, 2016 at 3:52 PM, Joe Witt  wrote:

> Igor,
>
> I believe it will encode whatever you give it in UTF-8 and place those
> bytes in.  For absolute control over the content of the demarcator
> that gets injected between each merged thing use the delimiter
> strategy of 'filename' and point at a file containing precisely the
> bytes you want.
>
> Thanks
> Joe
>
> On Thu, May 12, 2016 at 3:49 PM, Igor Kravzov 
> wrote:
> > Joe, If I put \n or '\n' the processor adds at as a string. How do i add
> it
> > as ASCII?
> >
> > On Thu, May 12, 2016 at 2:43 PM, Joe Witt  wrote:
> >>
> >> Igor,
> >>
> >> MergeContent [1] has a property for this purpose called "Demarcator"
> >> and you can set the "Delimiter Strategy" to "text" and put a value for
> >> the demarcator of \n.
> >>
> >> That should get you there I think.
> >>
> >> [1]
> >>
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.MergeContent/index.html
> >>
> >> Thanks
> >> Joe
> >>
> >> On Thu, May 12, 2016 at 2:40 PM, Igor Kravzov 
> >> wrote:
> >> > I have  workflow where EvaluateJson PR used to extract some values,
> >> > ReplaceText PR used to create a comma delimited line. Now I want to
> >> > create a
> >> > CSV file from these lines  Currently using MeregeContent PR but it
> >> > concatenates result lines incited of placing each on a new line.
> Should
> >> > I
> >> > just create a demarcatorfile with new line in it? Or there are other
> >> > options?
> >> >
> >> > Thanks in advance.
> >
> >
>

Re: Json Split

2016-05-17 Thread Bryan Bende

Hello,

I think this would probably be better handled by SplitText with a line
count of 1.

SplitJson would be more for splitting an array of JSON documents, or a
field that is an array.

-Bryan

On Tue, May 17, 2016 at 12:15 PM, Madhukar Thota 
wrote:

> I have a incoming json from kafka with two documents seperated by new line
>
> {"index":{"_index":"mylogger-2014.06.05","_type":"mytype-host.domain.com"}}{"json":"data","extracted":"from","message":"payload"}
>
>
> I want to get the second document after new line. How can i split the json
> by new line using SplitJSOn processor.
>

Re: Json Split

2016-05-17 Thread Bryan Bende

If you only want the second JSON document, can you send the output of
SplitText to EvaluateJsonPath and configure it to extract $.json ?

In your original example only the second document had a field called
"json", and the matched relationship coming out of EvaluateJsonPath will
only receive the json documents that had the path being extracted.

-Bryan


On Tue, May 17, 2016 at 1:52 PM, Madhukar Thota <madhukar.th...@gmail.com>
wrote:

> How do i get  entry-3: {"json":"data","extracted":"from","message":
> "payload"} only?
>
> On Tue, May 17, 2016 at 1:52 PM, Madhukar Thota <madhukar.th...@gmail.com>
> wrote:
>
>> Hi Andrew,
>>
>> I configured as you suggested, but in the queue i see three entries..
>>
>>
>> entry-1: {"index":{"_index":"mylogger-2014.06.05","_type":"
>> mytype-host.domain.com"}}
>> {"json":"data","extracted":"from","message":"payload"}
>>
>> entry-2: {"index":{"_index":"mylogger-2014.06.05","_type":"
>> mytype-host.domain.com"}}
>>
>> entry-3: {"json":"data","extracted":"from","message":"payload"}
>>
>>
>>
>>
>>
>> On Tue, May 17, 2016 at 1:29 PM, Andrew Grande <agra...@hortonworks.com>
>> wrote:
>>
>>> Try SplitText with a header line count of 1. It should skip it and give
>>> the 2nd line as a result.
>>>
>>> Andrew
>>>
>>> From: Madhukar Thota <madhukar.th...@gmail.com>
>>> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
>>> Date: Tuesday, May 17, 2016 at 12:31 PM
>>> To: "users@nifi.apache.org" <users@nifi.apache.org>
>>> Subject: Re: Json Split
>>>
>>> Hi Bryan,
>>>
>>> I tried with lineCount 1, i see it splitting two documents. But i need
>>> to only one document
>>>
>>> "{"json":"data","extracted":"from","message":"payload"}"
>>>
>>> How can i get that?
>>>
>>> On Tue, May 17, 2016 at 12:21 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I think this would probably be better handled by SplitText with a line
>>>> count of 1.
>>>>
>>>> SplitJson would be more for splitting an array of JSON documents, or a
>>>> field that is an array.
>>>>
>>>> -Bryan
>>>>
>>>> On Tue, May 17, 2016 at 12:15 PM, Madhukar Thota <
>>>> madhukar.th...@gmail.com> wrote:
>>>>
>>>>> I have a incoming json from kafka with two documents seperated by new
>>>>> line
>>>>>
>>>>> {"index":{"_index":"mylogger-2014.06.05","_type":"mytype-host.domain.com"}}{"json":"data","extracted":"from","message":"payload"}
>>>>>
>>>>>
>>>>> I want to get the second document after new line. How can i split the
>>>>> json by new line using SplitJSOn processor.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Json Split

2016-05-17 Thread Bryan Bende

I think another alternative could be to use RouteText...

If you set the Matching Strategy to "starts with" and add a dynamic
property called "matched" with a value of {"json  which will send any lines
that start with {"json to the matched relationship.

On Tue, May 17, 2016 at 3:08 PM, Bryan Bende <bbe...@gmail.com> wrote:

> If you only want the second JSON document, can you send the output of
> SplitText to EvaluateJsonPath and configure it to extract $.json ?
>
> In your original example only the second document had a field called
> "json", and the matched relationship coming out of EvaluateJsonPath will
> only receive the json documents that had the path being extracted.
>
> -Bryan
>
>
> On Tue, May 17, 2016 at 1:52 PM, Madhukar Thota <madhukar.th...@gmail.com>
> wrote:
>
>> How do i get  entry-3: {"json":"data","extracted":"from","message":
>> "payload"} only?
>>
>> On Tue, May 17, 2016 at 1:52 PM, Madhukar Thota <madhukar.th...@gmail.com
>> > wrote:
>>
>>> Hi Andrew,
>>>
>>> I configured as you suggested, but in the queue i see three entries..
>>>
>>>
>>> entry-1: {"index":{"_index":"mylogger-2014.06.05","_type":"
>>> mytype-host.domain.com"}}
>>> {"json":"data","extracted":"from","message":"payload"}
>>>
>>> entry-2: {"index":{"_index":"mylogger-2014.06.05","_type":"
>>> mytype-host.domain.com"}}
>>>
>>> entry-3: {"json":"data","extracted":"from","message":"payload"}
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 17, 2016 at 1:29 PM, Andrew Grande <agra...@hortonworks.com>
>>> wrote:
>>>
>>>> Try SplitText with a header line count of 1. It should skip it and give
>>>> the 2nd line as a result.
>>>>
>>>> Andrew
>>>>
>>>> From: Madhukar Thota <madhukar.th...@gmail.com>
>>>> Reply-To: "users@nifi.apache.org" <users@nifi.apache.org>
>>>> Date: Tuesday, May 17, 2016 at 12:31 PM
>>>> To: "users@nifi.apache.org" <users@nifi.apache.org>
>>>> Subject: Re: Json Split
>>>>
>>>> Hi Bryan,
>>>>
>>>> I tried with lineCount 1, i see it splitting two documents. But i need
>>>> to only one document
>>>>
>>>> "{"json":"data","extracted":"from","message":"payload"}"
>>>>
>>>> How can i get that?
>>>>
>>>> On Tue, May 17, 2016 at 12:21 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I think this would probably be better handled by SplitText with a line
>>>>> count of 1.
>>>>>
>>>>> SplitJson would be more for splitting an array of JSON documents, or a
>>>>> field that is an array.
>>>>>
>>>>> -Bryan
>>>>>
>>>>> On Tue, May 17, 2016 at 12:15 PM, Madhukar Thota <
>>>>> madhukar.th...@gmail.com> wrote:
>>>>>
>>>>>> I have a incoming json from kafka with two documents seperated by new
>>>>>> line
>>>>>>
>>>>>> {"index":{"_index":"mylogger-2014.06.05","_type":"mytype-host.domain.com"}}{"json":"data","extracted":"from","message":"payload"}
>>>>>>
>>>>>>
>>>>>> I want to get the second document after new line. How can i split the
>>>>>> json by new line using SplitJSOn processor.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Help Wanted

2016-05-03 Thread Bryan Bende

They are treated with same priority, but as Oleg mentioned, the PRs do make
it easier for collaborative review and has the built in integration with
Travis, although currently some issues to get it consistently working.

On Tue, May 3, 2016 at 11:26 AM, Suneel Marthi  wrote:

> PR is the standard now across most Apache projects.
>
> On Tue, May 3, 2016 at 11:25 AM, Oleg Zhurakousky <
> ozhurakou...@hortonworks.com> wrote:
>
>> Andrew
>>
>> Regarding PR vs. Patch.
>>
>> This has been an ongoing discussion and i’ll let other’s to contribute to
>> this. Basically we support both. That said, personally (and it appears to
>> be embraced by the rest of the community) PR is the preference specifically
>> due to the inline review/comment capabilities provided by GitHub.
>>
>> Cheers
>> Oleg
>>
>> > On May 3, 2016, at 11:18 AM, Andrew Psaltis 
>> wrote:
>> >
>> > Thank you Oleg!
>> >
>> > Yeah, that page with the Code Review, has a little refresh link, but it
>> > really just points to this JIRA query:
>> > https://issues.apache.org/jira/browse/NIFI-1837?filter=12331874
>> >
>> > As a community is there a preference given to JIRA's with Patch or GH
>> PR's
>> > or are they all treated with the same priority?
>> >
>> > Thanks,
>> > Andrew
>> >
>> > On Tue, May 3, 2016 at 11:12 AM, Oleg Zhurakousky <
>> > ozhurakou...@hortonworks.com> wrote:
>> >
>> >> Andrew
>> >>
>> >> Thank you so much for following up on this.
>> >> I am assuming you have GitHub account. If not please create one as
>> most of
>> >> our contributions deal with pull requests (PR).
>> >> Then you can go to https://github.com/apache/nifi , click on “Pull
>> >> Requests” and review them by commenting in line (you can see plenty of
>> >> examples there of PRs that are already in review process).
>> >>
>> >> I would also suggest to get familiar with Contributor’s guideline for
>> NiFi
>> >> - https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide.
>> But
>> >> it appears you have already done so and I think there may be small
>> >> discrepancy in the link you provided or may be it is not as dynamic.
>> >> In any event JIRA and GutHub are good resources to use.
>> >>
>> >> As for the last question, the best case scenario is both (code review
>> and
>> >> test). Having said that we do realize that your time and the time of
>> every
>> >> contributor may be limited, so I say whatever you can. Some time quick
>> code
>> >> scan can uncover the obvious that doesn’t need testing.
>> >>
>> >> Thanks again
>> >> Cheers
>> >> Oleg
>> >>
>> >> On May 3, 2016, at 11:07 AM, Andrew Psaltis 
>> >> wrote:
>> >>
>> >> Oleg,
>> >> I would love to help -- couple of quick questions:
>> >>
>> >> The GH PR's are ~60 as you indicated, but the How To Contribute guide
>> (Code
>> >> review process --
>> >>
>> >>
>> https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-CodeReviewProcess
>> >> ) shows a JIRA list with patches available.
>> >>
>> >> Which should be reviewed first? For the PR's on GH are you just
>> looking for
>> >> code review or same process of apply local merge and test?
>> >>
>> >> Thanks,
>> >> Andrew
>> >>
>> >> On 5/3/16, 9:58 AM, "Oleg Zhurakousky" 
>> >> wrote:
>> >>
>> >> Guys
>> >>
>> >> I’d like to use this opportunity to address all members of the NiFi
>> >>
>> >> community hence this email is sent to both mailing lists (dev/users)
>> >>
>> >>
>> >> While somewhat skeptical when I started 6 month ago, I have to admit
>> that
>> >>
>> >> now I am very excited to observe the growth and adaption of the Apache
>> NiFi
>> >> and say that in large part it’s because of the healthy community that
>> we
>> >> have here - committers and contributors alike representing variety of
>> >> business domains.
>> >>
>> >> This is absolutely great news for all of us and I am sure some if not
>> all
>> >>
>> >> of you share this sentiment.
>> >>
>> >>
>> >> That said and FWIW we need help!
>> >> While it’s great to wake up every morning to a set of new PRs and
>> patches,
>> >>
>> >> we now have a bit of a back log. In large this is due to the fact that
>> most
>> >> of our efforts are spent in development as we all try to grow NiFi
>> feature
>> >> base. However we need to remember that PRs and patches will remain as
>> they
>> >> are unless and until they are reviewed/agreed to be merged by this same
>> >> community and that is where we need help. While “merge"
>> responsibilities
>> >> are limited to “committers”, “review” is the responsibility of every
>> member
>> >> of this community and I would like to ask you if at all possible to
>> >> redirect some of your efforts to this process.
>> >>
>> >> We currently have 61 outstanding PRs and this particular development
>> cycle
>> >>
>> >> is a bit more complex then the previous ones since it addresses 0.7.0
>> and
>> >> 1.0.0 releases in parallel (so different approach to

Re: Help Wanted

2016-05-03 Thread Bryan Bende

The "Patch Available" state in JIRA can mean a patch is attached to the
JIRA, or a PR is submitted.

It is really just a manual state transition on the ticket after
In-Progress... the next state is patch available which tells people there
is something to review.

On Tue, May 3, 2016 at 11:29 AM, Andrew Psaltis <psaltis.and...@gmail.com>
wrote:

> Totally agree on all fronts. Would seem like it makes sense for a
> documentation PR to be opened soon with updates to the
> https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide#ContributorGuide-CodeReviewProcess
> page to remove the ambiguity.
>
>
> On Tue, May 3, 2016 at 11:27 AM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> They are treated with same priority, but as Oleg mentioned, the PRs do
>> make it easier for collaborative review and has the built in integration
>> with Travis, although currently some issues to get it consistently working.
>>
>> On Tue, May 3, 2016 at 11:26 AM, Suneel Marthi <smar...@apache.org>
>> wrote:
>>
>>> PR is the standard now across most Apache projects.
>>>
>>> On Tue, May 3, 2016 at 11:25 AM, Oleg Zhurakousky <
>>> ozhurakou...@hortonworks.com> wrote:
>>>
>>>> Andrew
>>>>
>>>> Regarding PR vs. Patch.
>>>>
>>>> This has been an ongoing discussion and i’ll let other’s to contribute
>>>> to this. Basically we support both. That said, personally (and it appears
>>>> to be embraced by the rest of the community) PR is the preference
>>>> specifically due to the inline review/comment capabilities provided by
>>>> GitHub.
>>>>
>>>> Cheers
>>>> Oleg
>>>>
>>>> > On May 3, 2016, at 11:18 AM, Andrew Psaltis <psaltis.and...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Thank you Oleg!
>>>> >
>>>> > Yeah, that page with the Code Review, has a little refresh link, but
>>>> it
>>>> > really just points to this JIRA query:
>>>> > https://issues.apache.org/jira/browse/NIFI-1837?filter=12331874
>>>> >
>>>> > As a community is there a preference given to JIRA's with Patch or GH
>>>> PR's
>>>> > or are they all treated with the same priority?
>>>> >
>>>> > Thanks,
>>>> > Andrew
>>>> >
>>>> > On Tue, May 3, 2016 at 11:12 AM, Oleg Zhurakousky <
>>>> > ozhurakou...@hortonworks.com> wrote:
>>>> >
>>>> >> Andrew
>>>> >>
>>>> >> Thank you so much for following up on this.
>>>> >> I am assuming you have GitHub account. If not please create one as
>>>> most of
>>>> >> our contributions deal with pull requests (PR).
>>>> >> Then you can go to https://github.com/apache/nifi , click on “Pull
>>>> >> Requests” and review them by commenting in line (you can see plenty
>>>> of
>>>> >> examples there of PRs that are already in review process).
>>>> >>
>>>> >> I would also suggest to get familiar with Contributor’s guideline
>>>> for NiFi
>>>> >> - https://cwiki.apache.org/confluence/display/NIFI/Contributor+Guide.
>>>> But
>>>> >> it appears you have already done so and I think there may be small
>>>> >> discrepancy in the link you provided or may be it is not as dynamic.
>>>> >> In any event JIRA and GutHub are good resources to use.
>>>> >>
>>>> >> As for the last question, the best case scenario is both (code
>>>> review and
>>>> >> test). Having said that we do realize that your time and the time of
>>>> every
>>>> >> contributor may be limited, so I say whatever you can. Some time
>>>> quick code
>>>> >> scan can uncover the obvious that doesn’t need testing.
>>>> >>
>>>> >> Thanks again
>>>> >> Cheers
>>>> >> Oleg
>>>> >>
>>>> >> On May 3, 2016, at 11:07 AM, Andrew Psaltis <
>>>> psaltis.and...@gmail.com>
>>>> >> wrote:
>>>> >>
>>>> >> Oleg,
>>>> >> I would love to help -- couple of quick questions:
>>>> >>
>>>> >> The GH PR's are ~60 as you indicated, but the How To Contribute
>>>> guide (Code
>&

Re: Failed to receive data from Hbase due to java net connection exception :Connection Refused in Nifi

2016-05-06 Thread Bryan Bende

rsistentProvenanceRepository Created new Provenance Event Writers
> for events starting with ID 128123
> 2016-05-06 20:03:16,584 INFO [Provenance Repository Rollover Thread-1]
> o.a.n.p.PersistentProvenanceRepository Successfully merged 16 journal files
> (4449 records) into single Provenance Log File
> ./provenance_repository/123674.prov in 698 milliseconds
> 2016-05-06 20:03:16,584 INFO [Provenance Repository Rollover Thread-1]
> o.a.n.p.PersistentProvenanceRepository Successfully Rolled over Provenance
> Event file containing 2770 records
> 2016-05-06 20:03:33,473 INFO [pool-18-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Initiating checkpoint of FlowFile
> Repository
> 2016-05-06 20:03:33,638 INFO [pool-18-thread-1]
> org.wali.MinimalLockingWriteAheadLog
> org.wali.MinimalLockingWriteAheadLog@26650de4 checkpointed with 0 Records
> and 0 Swap Files in 163 milliseconds (Stop-the-world time = 32
> milliseconds, Clear Edit Logs time = 35 millis), max Transaction ID 270109
> 2016-05-06 20:03:33,638 INFO [pool-18-thread-1]
> o.a.n.c.r.WriteAheadFlowFileRepository Successfully checkpointed FlowFile
> Repository with 0 records in 164 milliseconds
> 2016-05-06 20:03:34,057 INFO [FileSystemRepository Workers Thread-2]
> o.a.n.c.repository.FileSystemRepository Successfully archived 1 Resource
> Claims for Container default in 26 millis
>
> On Fri, May 6, 2016 at 7:50 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Ok, thanks for sharing that...
>>
>> Can you look in NIFI_HOME/logs/nifi-app.log and see if you can find that
>> error message that says "Failed to receive data from HBase due to",
>> and then there should be a whole stack trace that goes with that.
>>
>> If you could paste that whole stacktrace here that would be helpful.
>>
>> Thanks,
>>
>> Bryan
>>
>> On Fri, May 6, 2016 at 10:14 AM, Venkatesh Bodapati <
>> venkatesh.bodap...@inndata.in> wrote:
>>
>>> I am running NiFi, Hbase and Zookeeper in same machine.
>>>
>>> This my Hbase-site.xml :
>>> 
>>> 
>>> hbase.master
>>> localhost:6
>>> 
>>> 
>>> hbase.rootdir
>>> hdfs://localhost:8020/hbase
>>>     
>>> 
>>> hbase.cluster.distributed
>>> true
>>> 
>>> 
>>> hbase.zookeeper.quorum
>>> localhost
>>> 
>>> 
>>> dfs.replication
>>> 1
>>> 
>>> 
>>> hbase.zookeeper.property.clientPort
>>> 2181
>>> 
>>> 
>>> hbase.zookeeper.property.dataDir
>>> /usr/local/hadoop/hbase-1.1.2/zookeeper
>>> 
>>> 
>>> zookeeper.znode.parent
>>> /hbase
>>> 
>>> 
>>>
>>> and my Hbase-client Service properties like this :
>>> Hadoop Configuration FilesInfo  :
>>> /usr/local/hadoop/hadoop-2.6.0/etc/hadoop/core-site.xml,/usr/local/hadoop/hbase-1.1.2/conf/hbase-site.xml
>>> ZooKeeper QuorumInfo: No value set
>>> ZooKeeper Client PortInfo: No value set
>>> ZooKeeper ZNode ParentInfo   : /hbase
>>> HBase Client RetriesInfo   : 1
>>>
>>> On Fri, May 6, 2016 at 7:13 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>>> Can you describe your setup a little bit more, things like...
>>>>
>>>> Is NiFi running on the same machine that HBase and ZooKeeper are
>>>> running on (it doesn't have to be, just want to understand the setup) ?
>>>>
>>>> What values do you have in hbase-site.xml for properties like...
>>>> - hbase.zookeeper.quorum
>>>> - hbase.zookeeper.property.clientPort
>>>>
>>>> Here is a template that is working for me:
>>>>
>>>> https://gist.githubusercontent.com/bbende/9457b5ed261e6eeb0f98995a5a2699e0/raw/38e050828c2545cb50e623d6dcf45dbbc9ad1d9d/FunWithHBaseUpdated.xml
>>>>
>>>> I am running NiFi on the HDP Sandbox so everything is local for me.
>>>>
>>>>
>>>> On Fri, May 6, 2016 at 8:54 AM, Venkatesh Bodapati <
>>>> venkatesh.bodap...@inndata.in> wrote:
>>>>
>>>>>   " Missing Row id failure, routing to failure " issue solved . Still
>>>

Re: Cannot Authenticate the Azure Datalake Store

2016-05-06 Thread Bryan Bende

Hello,

It seems like maybe a wrong version of a library is being used, or maybe a
JAR is missing that needs to be included in your NAR.

Can you share what dependencies you have in the pom.xml of your processors
project?

You can check under
NIFI_HOME/work/nar/extensions//META-INF/bundled-dependencies/ to
see the JARs that are being included in your NAR.

-Bryan


On Fri, May 6, 2016 at 12:52 AM, Kumiko Yada  wrote:

> Hello,
>
>
>
> I’m writing the custom processor to create a file in the Azure Datalake
> Store.  I wrote the sample java program to do using this sample code,
> https://azure.microsoft.com/en-us/documentation/articles/data-lake-store-get-started-java-sdk/.
> This is working fine.  However, when I wrote the code in the custom
> processor, it’s failing if the Authenticate code with this this error when
> this custom processor is running.  I didn’t get any build error when I
> built this custom processor.   Do you have any ideas why I’m getting this
> error?  Is there any file I need to copy to the nifi folder other than nar
> file for the custom processor?
>
>
>
> *2016-05-05 20:37:37,338 ERROR [Timer-Driven Process Thread-4]
> d.p.custom.CreateFileAzureDatalakeStore
> CreateFileAzureDatalakeStore[id=f89e5860-1d99-4d20-be71-769f6bb775c7]
> CreateFileAzureDatalakeStore[id=f89e5860-1d99-4d20-be71-769f6bb775c7]
> failed to process due to java.lang.NoSuchMethodError:
> com.microsoft.azure.AzureServiceClient: method ()V not found; rolling
> back session: java.lang.NoSuchMethodError:
> com.microsoft.azure.AzureServiceClient: method ()V not found*
>
> *2016-05-05 20:37:37,342 INFO [Flow Service Tasks Thread-2]
> o.a.nifi.controller.StandardFlowService Saved flow controller
> org.apache.nifi.controller.FlowController@3dcbc96c // Another save pending
> = false*
>
> *2016-05-05 20:37:37,342 ERROR [Timer-Driven Process Thread-4]
> d.p.custom.CreateFileAzureDatalakeStore *
>
> *java.lang.NoSuchMethodError: com.microsoft.azure.AzureServiceClient:
> method ()V not found*
>
> *at
> com.microsoft.azure.management.datalake.store.DataLakeStoreAccountManagementClientImpl.(DataLakeStoreAccountManagementClientImpl.java:176)
> ~[na:na]*
>
> *at
> com.microsoft.azure.management.datalake.store.DataLakeStoreAccountManagementClientImpl.(DataLakeStoreAccountManagementClientImpl.java:166)
> ~[na:na]*
>
> *at
> dsiq.processors.custom.CreateFileAzureDatalakeStore.SetupClients(CreateFileAzureDatalakeStore.java:286)
> ~[na:na]*
>
> *at
> dsiq.processors.custom.CreateFileAzureDatalakeStore.onTrigger(CreateFileAzureDatalakeStore.java:269)
> ~[na:na]*
>
> *at
> org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27)
> ~[nifi-api-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]*
>
> *at
> org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1057)
> [nifi-framework-core-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]*
>
> *at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:136)
> [nifi-framework-core-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]*
>
> *at
> org.apache.nifi.controller.tasks.ContinuallyRunProcessorTask.call(ContinuallyRunProcessorTask.java:47)
> [nifi-framework-core-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]*
>
> *at
> org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:123)
> [nifi-framework-core-0.7.0-SNAPSHOT.jar:0.7.0-SNAPSHOT]*
>
> *at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [na:1.8.0_77]*
>
> *at
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [na:1.8.0_77]*
>
> *at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [na:1.8.0_77]*
>
> *at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [na:1.8.0_77]*
>
> *at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_77]*
>
> *at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_77]*
>
> *at java.lang.Thread.run(Thread.java:745)
> [na:1.8.0_77]*
>
> *2016-05-05 20:37:37,346 ERROR [Timer-Driven Process Thread-4]
> d.p.custom.CreateFileAzureDatalakeStore
> CreateFileAzureDatalakeStore[id=f89e5860-1d99-4d20-be71-769f6bb775c7]
> CreateFileAzureDatalakeStore[id=f89e5860-1d99-4d20-be71-769f6bb775c7]
> failed to process session due to java.lang.NoSuchMethodError:
> com.microsoft.azure.AzureServiceClient: method ()V not found:
> java.lang.NoSuchMethodError: com.microsoft.azure.AzureServiceClient: method
> ()V not found*
>
> *2016-05-05 20:37:37,347

Re: Failed to receive data from Hbase due to java net connection exception :Connection Refused in Nifi

2016-05-06 Thread Bryan Bende

Ok, thanks for sharing that...

Can you look in NIFI_HOME/logs/nifi-app.log and see if you can find that
error message that says "Failed to receive data from HBase due to",
and then there should be a whole stack trace that goes with that.

If you could paste that whole stacktrace here that would be helpful.

Thanks,

Bryan

On Fri, May 6, 2016 at 10:14 AM, Venkatesh Bodapati <
venkatesh.bodap...@inndata.in> wrote:

> I am running NiFi, Hbase and Zookeeper in same machine.
>
> This my Hbase-site.xml :
> 
> 
> hbase.master
> localhost:6
> 
> 
> hbase.rootdir
> hdfs://localhost:8020/hbase
> 
> 
> hbase.cluster.distributed
> true
> 
> 
> hbase.zookeeper.quorum
> localhost
> 
> 
> dfs.replication
> 1
> 
> 
> hbase.zookeeper.property.clientPort
> 2181
> 
> 
> hbase.zookeeper.property.dataDir
> /usr/local/hadoop/hbase-1.1.2/zookeeper
> 
> 
> zookeeper.znode.parent
> /hbase
> 
> 
>
> and my Hbase-client Service properties like this :
> Hadoop Configuration FilesInfo  :
> /usr/local/hadoop/hadoop-2.6.0/etc/hadoop/core-site.xml,/usr/local/hadoop/hbase-1.1.2/conf/hbase-site.xml
> ZooKeeper QuorumInfo: No value set
> ZooKeeper Client PortInfo    : No value set
> ZooKeeper ZNode ParentInfo   : /hbase
> HBase Client RetriesInfo   : 1
>
> On Fri, May 6, 2016 at 7:13 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Can you describe your setup a little bit more, things like...
>>
>> Is NiFi running on the same machine that HBase and ZooKeeper are running
>> on (it doesn't have to be, just want to understand the setup) ?
>>
>> What values do you have in hbase-site.xml for properties like...
>> - hbase.zookeeper.quorum
>> - hbase.zookeeper.property.clientPort
>>
>> Here is a template that is working for me:
>>
>> https://gist.githubusercontent.com/bbende/9457b5ed261e6eeb0f98995a5a2699e0/raw/38e050828c2545cb50e623d6dcf45dbbc9ad1d9d/FunWithHBaseUpdated.xml
>>
>> I am running NiFi on the HDP Sandbox so everything is local for me.
>>
>>
>> On Fri, May 6, 2016 at 8:54 AM, Venkatesh Bodapati <
>> venkatesh.bodap...@inndata.in> wrote:
>>
>>>   " Missing Row id failure, routing to failure " issue solved . Still i
>>> got the "failed to receive data from Hbase due to java net connection
>>> exception :Connection Refused" error in GetHbase Processor.  I update
>>> Hbase-Site.xml and remove remaining properties like " Zookeeper Quorum
>>> Info,Zookeeper Client Port Info . still i got the same error in GetHbase
>>> processor.
>>>
>>> On Thu, May 5, 2016 at 9:07 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>
>>>> I was looking at this a little more and instead of needing the
>>>> RouteOnAttribute, I think you can just update EvaluateJsonPath 
>>>> properties...
>>>>
>>>> Based on the JSON I see coming in, they should be:
>>>>
>>>> email = $.email
>>>> firstName = $.name.first
>>>> lastName = $.name.last
>>>> ssn = $.id.value
>>>>
>>>>
>>>>
>>>> On Thu, May 5, 2016 at 11:30 AM, Bryan Bende <bbe...@gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> For HBaseClient, are you sure your ZooKeeper being used by HBase is
>>>>> running on localhost:2181?
>>>>>
>>>>> Typically you don't really need to set the three ZooKeeper properties,
>>>>> and you can instead just set the hbase-site.xml in the config resources.
>>>>>
>>>>> For example, my Hadoop Configuration Resources is set as:
>>>>> /etc/hadoop/conf/core-site.xml,/etc/hbase/conf/hbase-site.xml
>>>>>
>>>>> And then I don't have anything specified for ZooKeeper QuorumInfo,
>>>>> etc., because hbase-site.xml is providing all that information.
>>>>>
>>>>> The example template requires a Users table, so from Hbase shell:
>>>>> create 'Users', {NAME => 'cf'}
>>>>>
>>>>> Then the template has a slight problem where it is using
>>>>> EvaluteJsonPa

Re: Failed to receive data from Hbase due to java net connection exception :Connection Refused in Nifi

2016-05-06 Thread Bryan Bende

Can you describe your setup a little bit more, things like...

Is NiFi running on the same machine that HBase and ZooKeeper are running on
(it doesn't have to be, just want to understand the setup) ?

What values do you have in hbase-site.xml for properties like...
- hbase.zookeeper.quorum
- hbase.zookeeper.property.clientPort

Here is a template that is working for me:
https://gist.githubusercontent.com/bbende/9457b5ed261e6eeb0f98995a5a2699e0/raw/38e050828c2545cb50e623d6dcf45dbbc9ad1d9d/FunWithHBaseUpdated.xml

I am running NiFi on the HDP Sandbox so everything is local for me.


On Fri, May 6, 2016 at 8:54 AM, Venkatesh Bodapati <
venkatesh.bodap...@inndata.in> wrote:

>   " Missing Row id failure, routing to failure " issue solved . Still i
> got the "failed to receive data from Hbase due to java net connection
> exception :Connection Refused" error in GetHbase Processor.  I update
> Hbase-Site.xml and remove remaining properties like " Zookeeper Quorum
> Info,Zookeeper Client Port Info . still i got the same error in GetHbase
> processor.
>
> On Thu, May 5, 2016 at 9:07 PM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> I was looking at this a little more and instead of needing the
>> RouteOnAttribute, I think you can just update EvaluateJsonPath properties...
>>
>> Based on the JSON I see coming in, they should be:
>>
>> email = $.email
>> firstName = $.name.first
>> lastName = $.name.last
>> ssn = $.id.value
>>
>>
>>
>> On Thu, May 5, 2016 at 11:30 AM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> For HBaseClient, are you sure your ZooKeeper being used by HBase is
>>> running on localhost:2181?
>>>
>>> Typically you don't really need to set the three ZooKeeper properties,
>>> and you can instead just set the hbase-site.xml in the config resources.
>>>
>>> For example, my Hadoop Configuration Resources is set as:
>>> /etc/hadoop/conf/core-site.xml,/etc/hbase/conf/hbase-site.xml
>>>
>>> And then I don't have anything specified for ZooKeeper QuorumInfo, etc.,
>>> because hbase-site.xml is providing all that information.
>>>
>>> The example template requires a Users table, so from Hbase shell:
>>> create 'Users', {NAME => 'cf'}
>>>
>>> Then the template has a slight problem where it is using EvaluteJsonPath
>>> to extract an attribute called "ssn" and using that for the row id... the
>>> problem is a lot of the ssn values are empty so they are failing.
>>>
>>> If you stick a RouteOnAttribute right after EvaluateJsonPath, and add a
>>> property "matched" = ${ssn:isEmpty():not()} that will filter out all the
>>> empty SSNs, then connect RouteOnAttribute's matched relationship to the
>>> HBase processors.
>>>
>>> Lets get the inserting side working and then I would expect the GetHBase
>>> to work since they use the same underlying connection.
>>>
>>> Thanks,
>>>
>>> Bryan
>>>
>>>
>>> On Thu, May 5, 2016 at 10:59 AM, Venkatesh Bodapati <
>>> venkatesh.bodap...@inndata.in> wrote:
>>>
>>>> I am working on "Fun_With_Hbase.xml" Template. In this i will send data
>>>> to hbase table, but i will not get data from hbase. I will get the "failed
>>>> to receive data from hbase due to java net connection exception :Connection
>>>> Refused" error in GetHbase Processor and In PutHbaseCell,PutHbaseJson
>>>> Processor i got error Like This " Missing Row id failure, routing to
>>>> failure ".
>>>>
>>>> This is my HbaseClient Properties :
>>>>
>>>> Hadoop Configuration FilesInfo  :
>>>> /usr/local/hadoop/hadoop-2.6.0/etc/hadoop/core-site.xml,/usr/local/hadoop/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
>>>> ZooKeeper QuorumInfo: localhost:2181
>>>> ZooKeeper Client PortInfo: 2181
>>>> ZooKeeper ZNode ParentInfo   : /hbase
>>>> HBase Client RetriesInfo   : 1
>>>>
>>>>
>>>> How do i get data from Hbase Table in Nifi,Any suggesions please.
>>>>
>>>
>>>
>>
>

Re: Multi data able joining

2016-05-04 Thread Bryan Bende

Hello,

I'm not sure if this will work depending how large the tables are, but
since you were already able to move to the data into three separate
tables...

Could you then do an ExecuteSQL processor that used a select query that
joined the three tables together, so the results coming out of the DB are
already joined? then insert those results into a new table?

Joining data streams is usually done in stream processing framework like
Flink, Spark streaming, or Storm.

Another option might be to prepare the data in NiFi and deliver it to one
of those stream processing systems,
do the join there and send the results back to NiFi to be inserted into the
new table.

Thanks,

Bryan

On Wed, May 4, 2016 at 3:22 AM, Ravisankar Mani  wrote:

> Hi All,
>
> I need some clarification about some workflow cases using apache NiFi
> tool. Please find the following scenarios.
>
>
>
>
>
> Get multi table from several datasource and to form a single table
>
>
>
> I have 3 tables from different data sources , one from MySQL, one from
> REST API and another from Excel. I have collected a tables from the
> different data sources and need to move a single table(all three into one)
> to SQL server
>
> 1.In this case, I did move to different tables to SQL server(mean separate
> table for each datasource table). But I need to merge into one(Join the
> multi tables) and move to a single table
>
> 2.Is there any default processors available to join multiple tables?
>
>
>
> Regards,
>
> Ravisankar
>
>

Re: Passing update processor to PutSolrContentStream

2016-05-04 Thread Bryan Bende

Hi Alex,

PutSolrContentStream would likely need the following configuration based on
what you described:

Solr Type = Cloud
Solr Location = Your ZK
Collection = feed
Content Stream Path = /update
Content Type = application/json

Add two user defined properties:

commit = true
update.chain = html-strip-junk

-Bryan


On Wed, May 4, 2016 at 4:38 AM, Alexander Alten-Lorenz 
wrote:

> Hi,
>
> I'm playing around with nifi and SolR, and try to figure out how I can add
> a processor to nifi so that get executed before the update runs. The SolR
> call would be: curl "
> http://localhost:8393/solr/feed_shard2_replica1/update?commit=true=html-strip-junk;
> -H 'Content-type:application/json' -d .
>
> I tried a additional field f.1, adding to split - but that didn't worked.
> I also added to content-stream the Content Stream Path
> (update?commit=true=html-strip-source_s/update/json/docs), but
> then I get Error: Search requests cannot accept content streams; routing to
> failure - which is correct.
>
>
> Any ideas how I can get my "remove-html" processor to work?
>
> thanks,
>  --alex
>
> --
> blog: mapredit.blogspot.com

Re: Failed to receive data from Hbase due to java net connection exception :Connection Refused in Nifi

2016-05-05 Thread Bryan Bende

Hello,

For HBaseClient, are you sure your ZooKeeper being used by HBase is running
on localhost:2181?

Typically you don't really need to set the three ZooKeeper properties, and
you can instead just set the hbase-site.xml in the config resources.

For example, my Hadoop Configuration Resources is set as:
/etc/hadoop/conf/core-site.xml,/etc/hbase/conf/hbase-site.xml

And then I don't have anything specified for ZooKeeper QuorumInfo, etc.,
because hbase-site.xml is providing all that information.

The example template requires a Users table, so from Hbase shell:
create 'Users', {NAME => 'cf'}

Then the template has a slight problem where it is using EvaluteJsonPath to
extract an attribute called "ssn" and using that for the row id... the
problem is a lot of the ssn values are empty so they are failing.

If you stick a RouteOnAttribute right after EvaluateJsonPath, and add a
property "matched" = ${ssn:isEmpty():not()} that will filter out all the
empty SSNs, then connect RouteOnAttribute's matched relationship to the
HBase processors.

Lets get the inserting side working and then I would expect the GetHBase to
work since they use the same underlying connection.

Thanks,

Bryan

On Thu, May 5, 2016 at 10:59 AM, Venkatesh Bodapati <
venkatesh.bodap...@inndata.in> wrote:

> I am working on "Fun_With_Hbase.xml" Template. In this i will send data to
> hbase table, but i will not get data from hbase. I will get the "failed to
> receive data from hbase due to java net connection exception :Connection
> Refused" error in GetHbase Processor and In PutHbaseCell,PutHbaseJson
> Processor i got error Like This " Missing Row id failure, routing to
> failure ".
>
> This is my HbaseClient Properties :
>
> Hadoop Configuration FilesInfo  :
> /usr/local/hadoop/hadoop-2.6.0/etc/hadoop/core-site.xml,/usr/local/hadoop/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
> ZooKeeper QuorumInfo: localhost:2181
> ZooKeeper Client PortInfo: 2181
> ZooKeeper ZNode ParentInfo   : /hbase
> HBase Client RetriesInfo   : 1
>
>
> How do i get data from Hbase Table in Nifi,Any suggesions please.
>

Re: Failed to receive data from Hbase due to java net connection exception :Connection Refused in Nifi

2016-05-05 Thread Bryan Bende

I was looking at this a little more and instead of needing the
RouteOnAttribute, I think you can just update EvaluateJsonPath properties...

Based on the JSON I see coming in, they should be:

email = $.email
firstName = $.name.first
lastName = $.name.last
ssn = $.id.value



On Thu, May 5, 2016 at 11:30 AM, Bryan Bende <bbe...@gmail.com> wrote:

> Hello,
>
> For HBaseClient, are you sure your ZooKeeper being used by HBase is
> running on localhost:2181?
>
> Typically you don't really need to set the three ZooKeeper properties, and
> you can instead just set the hbase-site.xml in the config resources.
>
> For example, my Hadoop Configuration Resources is set as:
> /etc/hadoop/conf/core-site.xml,/etc/hbase/conf/hbase-site.xml
>
> And then I don't have anything specified for ZooKeeper QuorumInfo, etc.,
> because hbase-site.xml is providing all that information.
>
> The example template requires a Users table, so from Hbase shell:
> create 'Users', {NAME => 'cf'}
>
> Then the template has a slight problem where it is using EvaluteJsonPath
> to extract an attribute called "ssn" and using that for the row id... the
> problem is a lot of the ssn values are empty so they are failing.
>
> If you stick a RouteOnAttribute right after EvaluateJsonPath, and add a
> property "matched" = ${ssn:isEmpty():not()} that will filter out all the
> empty SSNs, then connect RouteOnAttribute's matched relationship to the
> HBase processors.
>
> Lets get the inserting side working and then I would expect the GetHBase
> to work since they use the same underlying connection.
>
> Thanks,
>
> Bryan
>
>
> On Thu, May 5, 2016 at 10:59 AM, Venkatesh Bodapati <
> venkatesh.bodap...@inndata.in> wrote:
>
>> I am working on "Fun_With_Hbase.xml" Template. In this i will send data
>> to hbase table, but i will not get data from hbase. I will get the "failed
>> to receive data from hbase due to java net connection exception :Connection
>> Refused" error in GetHbase Processor and In PutHbaseCell,PutHbaseJson
>> Processor i got error Like This " Missing Row id failure, routing to
>> failure ".
>>
>> This is my HbaseClient Properties :
>>
>> Hadoop Configuration FilesInfo  :
>> /usr/local/hadoop/hadoop-2.6.0/etc/hadoop/core-site.xml,/usr/local/hadoop/hadoop-2.6.0/etc/hadoop/hdfs-site.xml
>> ZooKeeper QuorumInfo: localhost:2181
>> ZooKeeper Client PortInfo: 2181
>> ZooKeeper ZNode ParentInfo   : /hbase
>> HBase Client RetriesInfo   : 1
>>
>>
>> How do i get data from Hbase Table in Nifi,Any suggesions please.
>>
>
>

Re: howto dynamically change the PutHDFS target directory

2016-04-18 Thread Bryan Bende

Mike,

If I am understanding correctly I think this can be done today... The
Directory property on PutHDFS supports expression language, so you could
set it to a value like:

/data/${now():format('dd-MM-yy')}/

This could be set directly in PutHDFS, although it is also a common pattern
to stick an UpdateAttribute processor in front of PutHDFS and set filename
and hadoop.dir attributes, and then in PutHDFS reference those as
${filename} and ${hadoop.dir}

The advantage to the UpdateAttribute approach is that you can have a single
PutHDFS processor that actually writes to many different locations.

Hope that helps.

-Bryan


On Mon, Apr 18, 2016 at 2:53 PM, Oleg Zhurakousky <
ozhurakou...@hortonworks.com> wrote:

> Mike
>
> Indeed a very common requirement and we should support it.
> Would you mind raising a JIRA for it?
> https://issues.apache.org/jira/browse/NIFI
>
> Cheers
> Oleg
>
> On Apr 18, 2016, at 9:50 AM, Mike Harding  wrote:
>
> Hi All,
>
> I have a requirement to write a data stream into HDFS, where the flowfiles
> received per day are group into a directory. e.g. so I would end up with a
> folder structure as follows:
>
> data/18-04-16
> data/19-04-16
> data/20-04-16 ... etc
>
> Currently I can specify in the config for the putHDFS processor a target
> directory but I want this to change and point to a new directory as each
> day ends.
>
> So using nifi id like to 1) be able to create new directories in HDFS
> (although I could potentially write a bash script to do the directory
> creation) and 2) change the target directory as the day changes.
>
> Any help much appreciated,
>
> Mike
>
>
>

Re: StandardSSLContextService Not Being shown

2016-08-03 Thread Bryan Bende

Hello,

Adding the dev alias as well to see if anyone else knows the answer.

-Bryan

On Fri, Jul 29, 2016 at 10:37 AM, Mariama Barr-Dallas  wrote:

> Hello,
> I am attempting to add a Controller Service to a processor property via
> the rest API by changing the descriptors field and the properties field to
> the correct Id of the controller Service for the ProcessorEntity before
> uploading it;
> For some reason the first step is giving unexpected results: I cannot
> access any of the controller services via a GET rest api call until all of
> the other components (processors, process groups, connections) are loaded.
> Is there a certain order to when context services can be accessed and
> applied to a processor via the rest api?
>
>

Re: Data Provenance @scale in Nifi

2016-07-07 Thread Bryan Bende

Milind,

I'm not sure if I understand the question correctly, but are you asking how
to find a specific provenance event beyond the 1,000 most recent that are
displayed when loading the provenance view?

If so, there is a Search button in the top right of the Provenance window
that brings up a search window to search on specific fields or time ranges.

The fields available to search on can be customized in nifi.properties
through the following:

# Comma-separated list of fields. Fields that are not indexed will not be
searchable. Valid fields are:
# EventType, FlowFileUUID, Filename, TransitURI, ProcessorID,
AlternateIdentifierURI, Relationship, Details
nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID,
Filename, ProcessorID

# FlowFile Attributes that should be indexed and made searchable
nifi.provenance.repository.indexed.attributes=twitter.msg, language

In the above example, the attributes twitter.msg and language are
attributes that are being extracted from tweets using EvalueJSONPath.

Does this help?

-Bryan

On Thu, Jul 7, 2016 at 1:27 AM, milind parikh 
wrote:

> I am relatively new to Nifi. I have written a processor in Java for Nifi (
> which gives you an understanding of my knowledge about nifi; which is
> little)
>
> I have a scenario where there are about 100k flow files a day representing
> about 100m records; which needs to be aggregated across 1m data points
> across 100 dimensions.
>
> If in my architecture, I split the initial flow file into records and
> write them into Kafka for 1000 records per flow file and read in parallel,
> how do I do data provenance of the aggregated values.
>
> The use case that I am interested in is showing how one of the data points
> ( out  of 1m) arrived at the daily aggregated value for an average of 100
> records coming out of very few of the 100k files.
>
> I can't expand the data provenance through the UI (1000 initial records )
> and THEN through 1m data points OR traverse through 1 m data points in the
> UI as my starting point.
>
> I know the exact reference of the data point ( it's truncated version of
> the sha1 of a complex but unique datapoint string).
>
> Is there a command line equivalent of the UI that can be more precisely
> targeted for one data point?
>
> Thanks
> Milind
>

Re: Json routing

2016-07-07 Thread Bryan Bende

Anuj,

Just to clarify, you want to route on the name of the element under
POSTransaction? Meaning, route "Order" to one place and "Refund" to another?

I'm not a JSON Path expert, but I can't come up with a way to get  just an
element name from JSON path, it is usually used to get the value of a known
path.

If you used $.POSTransaction. as the expression I think you would get back
everything under POSTransaction  including the "Order" or "Refund" part,
and
then in RouteOnAttribute you could use expression language to see what it
starts ${yourAttribute:startsWith('Order')} or maybe use contains() instead
of startsWith.

Another completely different option is to use the ExecuteScript processor
to write a Groovy/Jython/etc script that gets the name of the first element
under POSTransaction and adds it as an attribute.

-Bryan

On Thu, Jul 7, 2016 at 8:34 AM, Anuj Handa  wrote:

> Hi Folks,
>
> I have following two JSON documents and i would like to route them based
> on what the value is. In the below examples its order and refund. i want
> this to be dynamic as i can expect range of values.
>
> i was thinking of using EvaluateJsonPath and reading the Value of this
> field in the attribute. i was unable to get what Path expression should be
> .
>
> $.POSTransaction.* returns me the entire JSON and not just the Order value
>
> is it possible to make it dynamic ?  or is there a better/different  way
> to do this
>
> {
> "POSTransaction": {
> "Order": {
>
> {
> "POSTransaction": {
> "Refund": {
>
>
> Regards,
> Anuj
>

Re: Processors in cluster mode

2016-08-08 Thread Bryan Bende

Hi Manish,

This post [1] has an overview of how to distribute data across your NiFi
cluster.

In general though, NiFi runs the same flow on each node and the data needs
to be divided across the nodes appropriately depending on the situation.
The only exception to running the same flow on every node is when a
processor is scheduled to run Primary Node only.

Concurrent Tasks is the number of threads that will concurrently call a
given instance of a processor. So if you have processor "Foo" and a three
node cluster, and set concurrent tasks to 2, there will be three instances
of Foo and each will have two threads calling the onTrigger method.

For some of your specific cases...

ListenTCP - You would have an instance of this process on all three nodes
and need the producer to send to all of them, or have a load balancer that
supports TCP sitting in front of the nodes and have the producer send to
the load balancer.
Get/Fetch File - These pick up files from the local filesystem so it would
be up to the data producer to send/write files on each node of the cluster
for each instance of this processor to pick up.
Distribute Load Processor - There will be a Distribute Load processor
running on each node and operating on only the flow files on that node.
ExecuteSQL - Typically you would run this on primary node only, or in an
upcoming release there is going to be some more options with a
ListDatabaseTable processor that can produce instructions than can be
distributed across a cluster to your ExecuteSQL processor.

Hope that helps.

Thanks,

Bryan

[1]
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

On Mon, Aug 8, 2016 at 7:55 AM, Manish Gupta 8  wrote:

> Hi,
>
>
>
> I am running a multi-node NiFi (0.7.0) cluster and trying to implement a
> streaming ingestion pipeline (@ 200 MB/s at peak and around 30 MB/s at
> non-peak hours) and routing to different destinations (Kafka, Azure
> Storage, HDFS). The dataflow will be exposing a TCP port for incoming data
> and will also be ingesting files from folder, database records etc.
>
>
>
> *It would be great if someone can provide a link/doc that explains how
> processors can be expected to behave in a multi-node environment.*
>
> *My doubts are about how some of the processors work in a clustered mode,
> and the meaning of concurrent tasks. *
>
>
>
> For example:
>
>
>
> · *ListenTCP*:
>
> o   When this processor is scheduled to run on a cluster (and not on the
> primary node), then does it mean I need to send data to all the individual
> nodes manually i.e. specify each node’s host names separately? If I don’t
> specify hosts individually and only provide let’s say primary node’s host
> name from producer, will all the other nodes remain idle? Or NiFi tries to
> distribute the data to other nodes using some routing strategy? I am trying
> to increase the throughput and thinking something like this as data
> producer strategy:
>
>
>
>
>
> And consumer will be simply as following:
>
>
>
>
>
> o   When I increase the number of concurrent tasks, does it make multiple
> copies of buffer/channel reader etc.? Or is it only the processing which
> gets multiplied?
>
> · *Get / Fetch File*:
>
> o   Can we assume that when this processor is running on multiple nodes
> and threads, the same file will never get pulled multiple times as a
> flow-file?
>
> · *Distribute Load Processor*:
>
> o   When this processor is running on multiple nodes, will all the
> incoming flow files go to each instance of running node? And this question
> is for any processor that run on a cluster and has to consume an incoming
> flow-file? What’s the general routing strategy in NiFi when a processor is
> running on multiple node?
>
> · *ExecuteSQL*
>
> o   Will all the running instances on all the nodes be hitting the RDBMS
> to fetch the data for the same query leading to duplicates, and heavy load
> on database?
>
>
>
> Thanks,
>
> Manish
>
>
>

Re: How to insert a ISO date recode using PutMongo processor?

2016-06-28 Thread Bryan Bende

I'm not very familiar with MongoDB, but I can see that when using "update"
mode PutMongo does this:

else {
// update
final boolean upsert = context.getProperty(UPSERT).asBoolean();
final String updateKey =
context.getProperty(UPDATE_QUERY_KEY).getValue();
final Document query = new Document(updateKey, doc.get(updateKey));

collection.replaceOne(query, doc, new UpdateOptions().upsert(upsert));
logger.info("updated {} into MongoDB", new Object[] { flowFile });
}

I'm wondering if that collection.replaceOne() is the problem, I see there
is collection.updateOne() which sounds more correct here.

If someone with more MongoDB experience could verify this I would be happy
to open a JIRA and get this changed.

-Bryan

On Tue, Jun 28, 2016 at 5:32 AM, Asanka Sanjaya Herath <angal...@gmail.com>
wrote:

> Hi Bryan,
>
> Your suggestion was worked fine. I have another question. This is not
> related to the subject, but it is related to PutMongo processor. How can I
> use put mongo processor to add a new key value pair to an existing
> document? The flow file contains the document object Id. I have set 'mode'
> property to 'update' and 'upsert' property to false and 'update query key'
> property to '_id'. Flow file content is something like this.
>
> {
> _id:ObjectId(577216f0154b943fe8068079)
> expired:true
> }
>
> Without inserting the 'expired:true', it replaces the whole document with
> the given one. So is there  a way to  insert the new key value pair to
> collection without replacing the whole collection in MongoDB using putmongo
> processor? Your concern regarding this is highly appreciated.
>
>
>
> On Mon, Jun 27, 2016 at 6:43 PM, Asanka Sanjaya Herath <angal...@gmail.com
> > wrote:
>
>> Hi Bryan,
>> Thank you for the input. That really helps. I'll try that.
>>
>> On Mon, Jun 27, 2016 at 6:31 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Right now AttributesToJson does treat everything as strings. I remember
>>> a previous discussion about adding support for different types, but I can't
>>> find a JIRA that references this.
>>>
>>> One option to work around this could be to use ReplaceText to construct
>>> the JSON, instead of AttributesToJson. You could set the Replacement Value
>>> property to something like:
>>>
>>> {
>>>   "dataSourceId" : "${datasource}",
>>>   "filename" : "${filename}",
>>>   "sent_date" : ${sent_date},
>>>   "uuid" : "${uuid}",
>>>   "originalSource" : "${originalsource}"
>>> }
>>>
>>> Of course using the appropriate attribute names.
>>>
>>> Another option is that in the upcoming 0.7.0 release, there is a new
>>> processor to transform JSON using JOLT. With that processor you may be able
>>> to take the output of AttributesToJson and apply a transform that converts
>>> the date field to remove the quotes.
>>>
>>> Hope that helps.
>>>
>>> -Bryan
>>>
>>> On Mon, Jun 27, 2016 at 8:16 AM, Asanka Sanjaya Herath <
>>> angal...@gmail.com> wrote:
>>>
>>>> I'm trying to insert a flow file to MongoDb which has a date record as
>>>> an attribute. First I sent that flow file through an attribute to JSON
>>>> processor, so that all attributes are now converted to a Json document in
>>>> flow file body. When I insert that flow file to Mongodb using PutMongo
>>>> processor, it saves the "sent_date" attribute as a String. I want this to
>>>> be saved as an ISO date object.
>>>>
>>>> My flow file looked like this.
>>>>
>>>> {
>>>>   "dataSourceId" : "",
>>>>   "filename" : "979f7bc5-a395-4396-9625-69fdb2c806c6",
>>>>   "sent_date" : "Mon Jan 18 04:50:50 IST 2016",
>>>>   "uuid" : "77a5ef56-8b23-40ee-93b5-78c6323e0e1c",
>>>>   "originalSource" : "ImportedZip"
>>>> }
>>>>
>>>> Then I prepend "ISODate" to "sent_date" attribute using another
>>>> processor. So now my flow file content looks like this.
>>>> {
>>>>   "dataSourceId" : "",
>>>>   "filename" : "979f7bc5-a395-4396-9625-69fdb2c806c6",
>>>>   "sent_date" : "ISODate('Mon Jan 18 04:50:50 IST 2016')",
>>>>   "uuid" : "77a5ef56-8b23-40ee-93b5-78c6323e0e1c",
>>>>   "originalSource" : "ImportedZip"
>>>> }
>>>>
>>>> But still It is saved as a string in MongoDB, because of the double
>>>> quotation marks. Is there a way to remove those double quotation marks when
>>>> convert using AttributeToJson processor?
>>>>
>>>> Any help is appreciated.
>>>>
>>>> --
>>>> Thanks,
>>>> Regards,
>>>> ASH
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Regards,
>> ASH
>>
>
>
>
> --
> Thanks,
> Regards,
> ASH
>

Re: Exception on Processor ConvertJSONToSQL

2016-08-15 Thread Bryan Bende

Carlos/Peter,

Thanks for reporting this issue. It seems IS_AUTOINCREMENT is causing
problems in a couple of situations, I know there was another issue with
Hive where they return  IS_AUTO_INCREMENT rather than  IS_AUTOINCREMENT.

We should definitely address this issue... would either of you be
interested in contributing your fix as a patch or PR? The try/catch
approach seems reasonable to me assuming there is no method on ResultSet to
check if the column exists.

As far as being able to remove ConvertAvroToJson if ExecuteSQL could
produce JSON... I think we initially went with Avro to preserve the most
information about the schema/types coming from the database, but I do
agree that many time the next step is to immediately convert it to some
other format. I think it would be ideal if there was a pluggable approach
when generating the output of these result sets so that any processors
dealing
with rows of data could provide an option to the user to select the output
format and choose from options like Avro, JSON, CSV. There likely needs to
be some more design discussion around it and of course development time :)

Thanks,

Bryan

On Mon, Aug 15, 2016 at 11:16 AM, Peter Wicks (pwicks) 
wrote:

> Carlos,
>
>
>
> I ran into this same error when querying Teradata. It looks like a lot of
> databases don’t include this.
>
> I submitted a bug a couple weeks ago: https://issues.apache.org/
> jira/browse/NIFI-2356
>
>
>
> I did something similar to your suggestion locally in a modified version
> of the code.
>
>
>
> Regards,
>
>   Peter
>
>
>
>
>
>
>
> *From:* Carlos Manuel Fernandes (DSI) [mailto:carlos.antonio.
> fernan...@cgd.pt]
> *Sent:* Thursday, August 11, 2016 9:20 AM
> *To:* users@nifi.apache.org
> *Subject:* Exception on Processor ConvertJSONToSQL
>
>
>
> Hi All,
>
>
>
> I am making some tests to move data from Db2 to Netezza using Nifi.   If I
> don’t use costume processors,   it’s a  indirect away :
>
>
>
> ExecuteSQL(on db2) -> ConvertAvroToJSON -> ConvertJSONToSQL -> PutSQL
> (bulk on netezza)
>
>
>
> and  this away, I have an Exception on ConvertJSONToSQL:
>
> org.netezza.error.NzSQLException: The column name IS_AUTOINCREMENT not
> found.
>
> at org.netezza.sql.NzResultSet.findColumn(NzResultSet.java:266)
> ~[nzjdbc.jar:na]
>
> at org.netezza.sql.NzResultSet.getString(NzResultSet.java:1407)
> ~[nzjdbc.jar:na]
>
> at 
> org.apache.commons.dbcp.DelegatingResultSet.getString(DelegatingResultSet.java:263)
> ~[na:na]
>
> at org.apache.nifi.processors.standard.ConvertJSONToSQL$
> ColumnDescription.from(ConvertJSONToSQL.java:678)
> ~[nifi-standard-processors-0.7.0.jar:0.7.0]
>
>
>
> Netezza jdbc driver doesn’t implement *IS_AUTOINCREMENT *metadata column
> ( the same is true for oracle driver). Probably the reason is Netezza and
> Oracle don’t have incremental columns because they use Sequences for this
> purpose.
>
>
>
> On possible solution it to put a try catch (isn’t beautiful) around
>
> final String autoIncrementValue = resultSet.getString("IS_AUTOINCREMENT");
> (ConvertJSONToSQL.java:678)
>
> and on the catch, put autoIncrementValue=’NO’
>
>
>
>
>
> Besides this error , we can remove  on  step ConvertAvroToJSON  in the
> flow  if  ExecuteSQL  is changed to generate optional
>
> Output: Avro or JSON.
>
>
>
> What you Think?
>
>
>
> Thanks
>
>
>
> Carlos
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: nifi at AWS

2017-01-22 Thread Bryan Bende

Hello,

I'm assuming you are using site-to-site since you mentioned failing to
create a transaction.

In nifi.properties on the AWS instance, there is probably a value for
nifi.remote.input.socket.port which would also need to be opened.

-Bryan

On Sat, Jan 21, 2017 at 7:00 PM, mohammed shambakey 
wrote:

> Hi
>
> I'm trying to send a file from a local nifi instatnce to a remote nifi
> instance in AWS. Security rules at remote instance has port 8080 opened,
> yet each time I try to send the file, local nifi says it failed to create
> transaction to the remote instance.
>
> Regards
>
> --
> Mohammed
>

1 2 3 4 5 6 7 8 >

1 - 100 of 769 matches

Mail list logo