Hi Matt, Thank you very much for your time and detailed answer.
I will try to explain a little bit more the use case, as I suppose that this should be close to standard patterns. My current use case involve the extraction of data from a Web API. This data is not as clean as we would like to but we have to cope with that. The data are then stored as CSV files on HDFS and an external Hive table is pointing to the directory. As the data have multiple duplicate/error that need to be clean up before ingestion, we have a PrestoDB view on this table that is providing a clean data set. The goal now is to select the data from this table and insert them in a another Hive/Presto table. For this I still need to join the data from this landing table with the data in the destination table to ensure that I'm not inserting a duplicate records. I do this with Presto. Once done, I need to move the file from this landing folder to another folder to empty/truncate the external table. This where my problem reside as I'm struggling to find an elegant way to trigger this file move only if the SQL process is successful. I tried to have a look at ListHDFS but this processor like GetHDFS cannot be triggered (except by CRON of course). As I need to receive enough data in this table to correlate them, I cannot have a flow process based on each file, which rules out FetchHDFS. The direct convertion to ORC is not working either as I have to filter/deduplicate a lot of information from the raw API data. The last idea we had was to do insert line by line in a landing table that will act as the current folder but with more flexibility as we won't have to take care of the CSV files. Unfortunately this doesn't work with using putHiveQL as we have around 10'000-100'000 lines to insert per minute and the putHiveQL processor cannot follow (we get 1-2 inserts per sec). We tried to use the Hivestreaming processor, but despite our effort we were unable to make it work with Kerberos and HDP 2.5 (The Nifi processor seems to require a the "hive-metastore" principal that we don't have, and when we create it, we still encounter Kerberos issue). Our last test was to use the Presto Teradata JDBC driver with the PutSQL processor but it doesn't work as this driver is in auto-commit mode that is incompatible with the processor. I'm currently trying to imagine a better flow that can go around the limitation, and will maybe try to use a SQL database as a buffer instead. This should provide me with a better way to control when to truncate this table. Of course any thought/recommendation are appreciated. Thanks! On Wed, Apr 5, 2017 at 10:16 PM, Matt Burgess <[email protected]> wrote: > Arnaud, > > Can you explain more about what you'd like to do via an INSERT query? > Are you trying to accomplish #3 using Hive via JDBC? If so you should > be able to use PutHiveQL rather than PutSQL. If you already have an > external table in Hive and don't yet have the ORC table, you should be > able to use a CREATE TABLE AS (CTAS) statement [1] in PutHiveQL. If > the ORC table exists and you want to insert from the external table, > you can use INSERT INTO/OVERWRITE [2]. Apologies if I misunderstood > what you are trying to do, if that's the case can you please > elaborate? > > Per your comment that you can't trigger GetHDFS, consider using > ListHDFS [3] and/or FetchHDFS [4] instead. If you know which files you > want (from the flow), you don't need ListHDFS, rather you'd just set > the filename attribute on the flow and route it to FetchHDFS. Having > said that, if you are already pulling the content of the HDFS files > into NiFi, perhaps consider the ConvertAvroToORC [5] processor (if you > can easily get your incoming data into Avro). This would allow you to > convert to ORC within NiFi, then you can use PutHDFS to land the files > on Hadoop, then PutHiveQL to create a table on top of the directory > containing the ORC files. If that is overkill, hopefully the > PutHiveQL with the CTAS or INSERT statements will suffice. > > Regards, > Matt > > [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL# > LanguageManualDDL-CreateTableAsSelect(CTAS) > [2] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML# > LanguageManualDML-InsertingdataintoHiveTablesfromqueries > [3] https://nifi.apache.org/docs/nifi-docs/components/org. > apache.nifi.processors.hadoop.ListHDFS/index.html > [4] https://nifi.apache.org/docs/nifi-docs/components/org. > apache.nifi.processors.hadoop.FetchHDFS/index.html > [5] https://nifi.apache.org/docs/nifi-docs/components/org. > apache.nifi.processors.hive.ConvertAvroToORC/index.html > > > On Wed, Apr 5, 2017 at 8:45 AM, Arnaud G <[email protected]> wrote: > > Hi, > > > > I'm currently building a flow in Nifi and I'm trying to get the best way > to > > do it in a reliable manner: > > > > The setup is the following: > > > > 1) Some files are copied in a folder in HDFS > > 2) An Hive external table point to this directory > > 3) The data of this table are then copied in an ORC table > > 4) The data from the folder are archived and compress in another folder > > > > My first issue is that I cannot easily trigger an Insert SQL query from > > Nifi. ExecuteSQL processor only execute SELECT query and not INSERT > query. I > > can of course Select all the data and bring them back in Nifi and then > use a > > PutSQL but as the data are going to be copied as is, it doesn't bring any > > value. > > My current solution is to rely on an external python script (using JDBC > from > > there) and use the ExecuteStreamCommand to trigger the insert from the > > external table. It is not very elegant but it seems to work. > > > > Now I have to ensure that the SQL query is successful before moving the > file > > to an other folder, otherwise I will end up with inconsistent data. I'm > > currently using the GetHDFS/PutHDFS to move file around however it is not > > possible to trigger the GetHDFS processor. > > > > What will be the best strategy to move the HDFS file only if a previous > > event is successful? Any recommendation? > > > > Thanks for your help! > > > > Regards, > > > > > > > > > > >
