Arnaud,

Can you explain more about what you'd like to do via an INSERT query?
Are you trying to accomplish #3 using Hive via JDBC?  If so you should
be able to use PutHiveQL rather than PutSQL. If you already have an
external table in Hive and don't yet have the ORC table, you should be
able to use a CREATE TABLE AS (CTAS) statement [1] in PutHiveQL.  If
the ORC table exists and you want to insert from the external table,
you can use INSERT INTO/OVERWRITE [2].  Apologies if I misunderstood
what you are trying to do, if that's the case can you please
elaborate?

Per your comment that you can't trigger GetHDFS, consider using
ListHDFS [3] and/or FetchHDFS [4] instead. If you know which files you
want (from the flow), you don't need ListHDFS, rather you'd just set
the filename attribute on the flow and route it to FetchHDFS.  Having
said that, if you are already pulling the content of the HDFS files
into NiFi, perhaps consider the ConvertAvroToORC [5] processor (if you
can easily get your incoming data into Avro). This would allow you to
convert to ORC within NiFi, then you can use PutHDFS to land the files
on Hadoop, then PutHiveQL to create a table on top of the directory
containing the ORC files.  If that is overkill, hopefully the
PutHiveQL with the CTAS or INSERT statements will suffice.

Regards,
Matt

[1] 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableAsSelect(CTAS)
[2] 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries
[3] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.ListHDFS/index.html
[4] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hadoop.FetchHDFS/index.html
[5] 
https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.hive.ConvertAvroToORC/index.html


On Wed, Apr 5, 2017 at 8:45 AM, Arnaud G <[email protected]> wrote:
> Hi,
>
> I'm currently building a flow in Nifi and I'm trying to get the best way to
> do it in a reliable manner:
>
> The setup is the following:
>
> 1) Some files are copied in a folder in HDFS
> 2) An Hive external table point to this directory
> 3) The data of this table are then copied in an ORC table
> 4) The data from the folder are archived and compress in another folder
>
> My first issue is that I cannot easily trigger an Insert SQL query from
> Nifi. ExecuteSQL processor only execute SELECT query and not INSERT query. I
> can of course Select all the data and bring them back in Nifi and then use a
> PutSQL but as the data are going to be copied as is, it doesn't bring any
> value.
> My current solution is to rely on an external python script (using JDBC from
> there) and use the ExecuteStreamCommand to trigger the insert from the
> external table. It is not very elegant but it seems to work.
>
> Now I have to ensure that the SQL query is successful before moving the file
> to an other folder, otherwise I will end up with inconsistent data. I'm
> currently using the GetHDFS/PutHDFS to move file around however it is not
> possible to trigger the GetHDFS processor.
>
> What will be the best strategy to move the HDFS file only if a previous
> event is successful? Any recommendation?
>
> Thanks for your help!
>
> Regards,
>
>
>
>
>

Reply via email to