Re: SelectHiveQL HiveConnectionPool issues

2016-05-09 Thread Mike Harding
Query is "select * from  limit 1"

Table schema has a map column type which is the cause, the
rest are string.

Cheers,
Mike



On Mon, 9 May 2016 at 17:56, Matt Burgess  wrote:

> Mike,
>
> It shouldn't matter what the underlying format is, as the Hive driver
> should take care of the type coercion. Your error refers to a column that
> is of type JAVA_OBJECT, which in Hive usually happens when you have an
> "interval" type (Added in Hive 1.2.0 [1] but apparently not yet
> documented). Does your select query do things like date arithmetic? If so,
> the SelectHiveQL processor does not currently support interval types, but I
> can take a look. If not, then perhaps one or more of your columns needs
> explicit type coercion in the SELECT query, such that it is recognized as a
> more "conventional" SQL type.
>
> Regards,
> Matt
>
> [1] https://issues.apache.org/jira/browse/HIVE-9792
>
>
> On Mon, May 9, 2016 at 12:34 PM, Mike Harding 
> wrote:
>
>> aaah of course! Thanks Matt that fixed it.
>> When I run my select query I can now receive the results in CSV but when
>> I select to export it in Avro I get the following exception:
>>
>> [image: Inline images 1]
>>
>> I'm assuming this is happening because the underlying data on HDFS my
>> hive table is reading from is not Avro? its currently standard JSON.
>>
>> Thanks,
>> Mike
>>
>>
>>
>>
>>
>>
>> On 9 May 2016 at 17:09, Matt Burgess  wrote:
>>
>>> Your URL has a scheme of "mysql", try replacing with "hive2", and also
>>> maybe explicitly setting the port:
>>>
>>> jdbc:hive2://:1/default
>>>
>>> If that doesn't work, can you see if there is an error/stack trace in
>>> logs/nifi-app.log?
>>>
>>> Regards,
>>> Matt
>>>
>>> On Mon, May 9, 2016 at 12:04 PM, Mike Harding 
>>> wrote:
>>> > Hi All,
>>> >
>>> > I'm trying to test out the new SelectHiveQL processor but I'm
>>> struggling to
>>> > get the HiveConnectionPool configured correctly as I keep getting
>>> 'error
>>> > getting hive connection'.
>>> >
>>> > I'm setting the database URL to my db 'default' as
>>> > jdbc:mysql:///default
>>> >
>>> > Nifi is installed on a different node in my cluster so I have set the
>>> > hive-site.xml to point to /etc/spark/2.4.0.0-169/0/hive-site.xml
>>> >
>>> > I currently have Hive Authorization = None and HIveServer2
>>> authentication =
>>> > none but I still specify a user name used to create the db without a
>>> > password.
>>> >
>>> > Would appreciate it if someone could share how they have things
>>> configured.
>>> >
>>> > Thanks,
>>> > Mike
>>>
>>
>>
>


Re: SelectHiveQL HiveConnectionPool issues

2016-05-09 Thread Mike Harding
aaah of course! Thanks Matt that fixed it.
When I run my select query I can now receive the results in CSV but when I
select to export it in Avro I get the following exception:

[image: Inline images 1]

I'm assuming this is happening because the underlying data on HDFS my hive
table is reading from is not Avro? its currently standard JSON.

Thanks,
Mike






On 9 May 2016 at 17:09, Matt Burgess  wrote:

> Your URL has a scheme of "mysql", try replacing with "hive2", and also
> maybe explicitly setting the port:
>
> jdbc:hive2://:1/default
>
> If that doesn't work, can you see if there is an error/stack trace in
> logs/nifi-app.log?
>
> Regards,
> Matt
>
> On Mon, May 9, 2016 at 12:04 PM, Mike Harding 
> wrote:
> > Hi All,
> >
> > I'm trying to test out the new SelectHiveQL processor but I'm struggling
> to
> > get the HiveConnectionPool configured correctly as I keep getting 'error
> > getting hive connection'.
> >
> > I'm setting the database URL to my db 'default' as
> > jdbc:mysql:///default
> >
> > Nifi is installed on a different node in my cluster so I have set the
> > hive-site.xml to point to /etc/spark/2.4.0.0-169/0/hive-site.xml
> >
> > I currently have Hive Authorization = None and HIveServer2
> authentication =
> > none but I still specify a user name used to create the db without a
> > password.
> >
> > Would appreciate it if someone could share how they have things
> configured.
> >
> > Thanks,
> > Mike
>


Re: SelectHiveQL HiveConnectionPool issues

2016-05-09 Thread Matt Burgess
Your URL has a scheme of "mysql", try replacing with "hive2", and also
maybe explicitly setting the port:

jdbc:hive2://:1/default

If that doesn't work, can you see if there is an error/stack trace in
logs/nifi-app.log?

Regards,
Matt

On Mon, May 9, 2016 at 12:04 PM, Mike Harding  wrote:
> Hi All,
>
> I'm trying to test out the new SelectHiveQL processor but I'm struggling to
> get the HiveConnectionPool configured correctly as I keep getting 'error
> getting hive connection'.
>
> I'm setting the database URL to my db 'default' as
> jdbc:mysql:///default
>
> Nifi is installed on a different node in my cluster so I have set the
> hive-site.xml to point to /etc/spark/2.4.0.0-169/0/hive-site.xml
>
> I currently have Hive Authorization = None and HIveServer2 authentication =
> none but I still specify a user name used to create the db without a
> password.
>
> Would appreciate it if someone could share how they have things configured.
>
> Thanks,
> Mike


Re: Logstash/ Filebeat/ Lumberjack -> Nifi

2016-05-09 Thread Andrew Grande
Conrad,

Set up a site-to-site connection between nifi edge nodes and your main 
processing cluster running a bigger nifi instance. This is the 'application' 
level protocol native to NiFi. MiNiFi, in turn, uses it under the hood as well, 
which will ease migration for you in the _near_ future ;)

Andrew



On Sun, May 8, 2016 at 11:52 PM -0700, "Conrad Crampton" 
> wrote:

Thanks for this – you make some very interesting points about the use of 
Logstash and you are correct, I am only just looking at Logstash but will now 
look to use Nifi if possible instead to connect to my central cluster.
Regards
Conrad

From: Andrew Psaltis >
Reply-To: "users@nifi.apache.org" 
>
Date: Saturday, 7 May 2016 at 16:43
To: "users@nifi.apache.org" 
>
Subject: Re: Logstash/ Filebeat/ Lumberjack -> Nifi

Hi Conrad,
Based on your email it sounds like you are potentially just getting started 
with Logstash. The one thing I can share is that up until recently I worked in 
an environment where we had ~3,000 nodes deployed and all either had Logstash 
or Flume (was transitioning to Logstash). We used Puppet and the Logstash 
module was in the base templates so as App developers provisioned new nodes 
Logstash was automatically deployed and configured. I can tell you that it 
seems really easy at first, however, my team was always messing with, tweaking, 
and troubleshooting the Logstash scripts as we wanted to ingest different data 
sources, modify how the data was captured, or fix bugs. Knowing now what I do 
about NiFi, if I had a chance to do it over again (will be talking to old 
colleagues about it) I would just use Nifi on all of those edge nodes and then 
send the data to central NiFi cluster. To me there are at least several huge 
benefits to this:

  1.  You use one tool, which provides an amazingly easy and very powerful way 
to control and adjust the dataflow all without having to muck with any scripts. 
You can easily filter / enrich / transform the data at the edge node all via a 
UI.
  2.  You get provenance information from the edge all the way back. This is 
very powerful, you can actually answer the questions from others of "how come 
my log entry never made it to System X" or even better how the data was changed 
along the way. The "why did my log entry make it to System X" sometimes can be 
answered via searching through logs, but that also assumes you have the 
information in the logs to begin with. I can tell you that these questions will 
come up. We had data that would go through a pipeline and finally into HDFS. 
And we would get the questions from app developers when they queried the data 
in Hive and wanted to know why certain log entries were missing.

Hope this helps.

In good health,
Andrew

On Sat, May 7, 2016 at 8:15 AM, Conrad Crampton 
> wrote:
Hi Bryan,
Some good tips and validation of my thinking.
It did occur to me to use the standalone NiFi and as I have no particular need 
to use Logstash for any other reason.
Thanks
Conrad

From: Bryan Bende >
Reply-To: "users@nifi.apache.org" 
>
Date: Friday, 6 May 2016 at 14:56
To: "users@nifi.apache.org" 
>
Subject: Re: Logstash/ Filebeat/ Lumberjack -> Nifi

Hi Conrad,

I am not that familiar with LogStash, but as you mentioned there is a PR for 
Lumberjack processors [1] which is not yet released, but could help if you are 
already using LogStash.
If LogStash has outputs for TCP, UDP, or syslog then like you mentioned, it 
seems like this could work well with ListenTCP, ListenUDP, or ListenSyslog.

I think the only additional benefit of Lumberjack is that it is an application 
level protocol that provides additional reliability on top of the networking 
protocols, meaning if ListenLumberjack receives an event over TCP it would then 
acknowledge that NiFi has successfully received and stored the data, since TCP 
can only guarantee it was delivered to the socket, but the application could 
have dropped it.

Although MiNiFi is not yet released, a possible solution is to run standalone 
NiFi instances on the servers where your logs are, with a simple flow like 
TailFile -> Remote Process Group which sends the logs back to a central NiFi 
instance over Site-To-Site.

Are you able to share any more info about what kind of logs they are and how 
they are being produced?
If they are coming from Java applications using logback or log4j, and if you 
have control over those applications, you can also use a specific 

Re: How to convert data in csv file into json data in nifi

2016-05-09 Thread Mark Payne
Venkatesh,

Right now, there is no direct way to go from CSV to JSON. You can however 
convert CSV to Avro with
the ConvertCSVToAvro processor and then go from Avro to JSON via the 
ConvertAvroToJSON. The
ConvertCSVToAvro Processor will require an Avro schema, but this can be 
automatically detected using
the InferAvroSchema Processor.

Obviously, though, this isn't as straight forward and easy as we would like, 
though. There is a JIRA [1]
to build a Processor that converts directly from CSV to JSON. If you are 
interested in building such a
Processor, we would be more than happy to assist however we can.

Thanks!
-Mark


[1] https://issues.apache.org/jira/browse/NIFI-1398


> On May 7, 2016, at 11:43 AM, Venkatesh Bodapati 
>  wrote:
> 
> Is there any way to convert data in CSV file to json data in Nifi. is it 
> possible and what are the processors is used to convert data into json data.



Re: Logstash/ Filebeat/ Lumberjack -> Nifi

2016-05-09 Thread Conrad Crampton
Thanks for this – you make some very interesting points about the use of 
Logstash and you are correct, I am only just looking at Logstash but will now 
look to use Nifi if possible instead to connect to my central cluster.
Regards
Conrad

From: Andrew Psaltis >
Reply-To: "users@nifi.apache.org" 
>
Date: Saturday, 7 May 2016 at 16:43
To: "users@nifi.apache.org" 
>
Subject: Re: Logstash/ Filebeat/ Lumberjack -> Nifi

Hi Conrad,
Based on your email it sounds like you are potentially just getting started 
with Logstash. The one thing I can share is that up until recently I worked in 
an environment where we had ~3,000 nodes deployed and all either had Logstash 
or Flume (was transitioning to Logstash). We used Puppet and the Logstash 
module was in the base templates so as App developers provisioned new nodes 
Logstash was automatically deployed and configured. I can tell you that it 
seems really easy at first, however, my team was always messing with, tweaking, 
and troubleshooting the Logstash scripts as we wanted to ingest different data 
sources, modify how the data was captured, or fix bugs. Knowing now what I do 
about NiFi, if I had a chance to do it over again (will be talking to old 
colleagues about it) I would just use Nifi on all of those edge nodes and then 
send the data to central NiFi cluster. To me there are at least several huge 
benefits to this:

  1.  You use one tool, which provides an amazingly easy and very powerful way 
to control and adjust the dataflow all without having to muck with any scripts. 
You can easily filter / enrich / transform the data at the edge node all via a 
UI.
  2.  You get provenance information from the edge all the way back. This is 
very powerful, you can actually answer the questions from others of "how come 
my log entry never made it to System X" or even better how the data was changed 
along the way. The "why did my log entry make it to System X" sometimes can be 
answered via searching through logs, but that also assumes you have the 
information in the logs to begin with. I can tell you that these questions will 
come up. We had data that would go through a pipeline and finally into HDFS. 
And we would get the questions from app developers when they queried the data 
in Hive and wanted to know why certain log entries were missing.

Hope this helps.

In good health,
Andrew

On Sat, May 7, 2016 at 8:15 AM, Conrad Crampton 
> wrote:
Hi Bryan,
Some good tips and validation of my thinking.
It did occur to me to use the standalone NiFi and as I have no particular need 
to use Logstash for any other reason.
Thanks
Conrad

From: Bryan Bende >
Reply-To: "users@nifi.apache.org" 
>
Date: Friday, 6 May 2016 at 14:56
To: "users@nifi.apache.org" 
>
Subject: Re: Logstash/ Filebeat/ Lumberjack -> Nifi

Hi Conrad,

I am not that familiar with LogStash, but as you mentioned there is a PR for 
Lumberjack processors [1] which is not yet released, but could help if you are 
already using LogStash.
If LogStash has outputs for TCP, UDP, or syslog then like you mentioned, it 
seems like this could work well with ListenTCP, ListenUDP, or ListenSyslog.

I think the only additional benefit of Lumberjack is that it is an application 
level protocol that provides additional reliability on top of the networking 
protocols, meaning if ListenLumberjack receives an event over TCP it would then 
acknowledge that NiFi has successfully received and stored the data, since TCP 
can only guarantee it was delivered to the socket, but the application could 
have dropped it.

Although MiNiFi is not yet released, a possible solution is to run standalone 
NiFi instances on the servers where your logs are, with a simple flow like 
TailFile -> Remote Process Group which sends the logs back to a central NiFi 
instance over Site-To-Site.

Are you able to share any more info about what kind of logs they are and how 
they are being produced?
If they are coming from Java applications using logback or log4j, and if you 
have control over those applications, you can also use a specific appender like 
a UDP appender to send directly over to ListenUDP in NiFi.

Hope that helps.

-Bryan

[1] https://github.com/apache/nifi/pull/290

On Fri, May 6, 2016 at 3:33 AM, Conrad Crampton 
> wrote:
Hi,
Some advice if possible please. Whilst I would love to wait for the MiNiFi 
project realise its objectives as this sounds exactly what I want from the 
initial suggestions