Re: GetHDFS and triggering

Arnaud G Thu, 06 Apr 2017 06:11:11 -0700

Hi Matt,

Thank you very much for your time and detailed answer.

I will try to explain a little bit more the use case, as I suppose that
this should be close to standard patterns.

My current use case involve the extraction of data from a Web API. This
data is not as clean as we would like to but we have to cope with that. The
data are then stored as CSV files on HDFS and an external Hive table is
pointing to the directory. As the data have multiple duplicate/error that
need to be clean up before ingestion, we have a PrestoDB view on this table
that is providing a clean data set.

The goal now is to select the data from this table and insert them in a
another Hive/Presto table. For this I still need to join the data from this
landing table with the data in the destination table to ensure that I'm not
inserting a duplicate records. I do this with Presto. Once done, I need to
move the file from this landing folder to another folder to empty/truncate
the external table. This where my problem reside as I'm struggling to find
an elegant way to trigger this file move only if the SQL process is
successful.

I tried to have a look at ListHDFS but this processor like GetHDFS cannot
be triggered (except by CRON of course). As I need to receive enough data
in this table to correlate them, I cannot have a flow process based on each
file, which rules out FetchHDFS. The direct convertion to ORC is not
working either as I have to filter/deduplicate a lot of information from
the raw API data.

The last idea we had was to do insert line by line in a landing table that
will act as the current folder but with more flexibility as we won't have
to take care of the CSV files. Unfortunately this doesn't work with using
putHiveQL as we have around 10'000-100'000 lines to insert per minute and
the putHiveQL processor cannot follow (we get 1-2 inserts per sec). We
tried to use the Hivestreaming processor, but despite our effort we were
unable to make it work with Kerberos and HDP 2.5 (The Nifi processor seems
to require a the "hive-metastore" principal that we don't have, and when we
create it, we still encounter Kerberos issue). Our last test was to use the
Presto Teradata JDBC driver with the PutSQL processor but it doesn't work
as this driver is in auto-commit mode that is incompatible with the
processor.

I'm currently trying to imagine a better flow that can go around the
limitation, and will maybe try to use a SQL database as a buffer instead.
This should provide me with a better way to control when to truncate this
table. Of course any thought/recommendation are appreciated.

Thanks!

On Wed, Apr 5, 2017 at 10:16 PM, Matt Burgess <[email protected]> wrote:

> Arnaud,
>
> Can you explain more about what you'd like to do via an INSERT query?
> Are you trying to accomplish #3 using Hive via JDBC?  If so you should
> be able to use PutHiveQL rather than PutSQL. If you already have an
> external table in Hive and don't yet have the ORC table, you should be
> able to use a CREATE TABLE AS (CTAS) statement [1] in PutHiveQL.  If
> the ORC table exists and you want to insert from the external table,
> you can use INSERT INTO/OVERWRITE [2].  Apologies if I misunderstood
> what you are trying to do, if that's the case can you please
> elaborate?
>
> Per your comment that you can't trigger GetHDFS, consider using
> ListHDFS [3] and/or FetchHDFS [4] instead. If you know which files you
> want (from the flow), you don't need ListHDFS, rather you'd just set
> the filename attribute on the flow and route it to FetchHDFS.  Having
> said that, if you are already pulling the content of the HDFS files
> into NiFi, perhaps consider the ConvertAvroToORC [5] processor (if you
> can easily get your incoming data into Avro). This would allow you to
> convert to ORC within NiFi, then you can use PutHDFS to land the files
> on Hadoop, then PutHiveQL to create a table on top of the directory
> containing the ORC files.  If that is overkill, hopefully the
> PutHiveQL with the CTAS or INSERT statements will suffice.
>
> Regards,
> Matt
>
> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#
> LanguageManualDDL-CreateTableAsSelect(CTAS)
> [2] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#
> LanguageManualDML-InsertingdataintoHiveTablesfromqueries
> [3] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi.processors.hadoop.ListHDFS/index.html
> [4] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi.processors.hadoop.FetchHDFS/index.html
> [5] https://nifi.apache.org/docs/nifi-docs/components/org.
> apache.nifi.processors.hive.ConvertAvroToORC/index.html
>
>
> On Wed, Apr 5, 2017 at 8:45 AM, Arnaud G <[email protected]> wrote:
> > Hi,
> >
> > I'm currently building a flow in Nifi and I'm trying to get the best way
> to
> > do it in a reliable manner:
> >
> > The setup is the following:
> >
> > 1) Some files are copied in a folder in HDFS
> > 2) An Hive external table point to this directory
> > 3) The data of this table are then copied in an ORC table
> > 4) The data from the folder are archived and compress in another folder
> >
> > My first issue is that I cannot easily trigger an Insert SQL query from
> > Nifi. ExecuteSQL processor only execute SELECT query and not INSERT
> query. I
> > can of course Select all the data and bring them back in Nifi and then
> use a
> > PutSQL but as the data are going to be copied as is, it doesn't bring any
> > value.
> > My current solution is to rely on an external python script (using JDBC
> from
> > there) and use the ExecuteStreamCommand to trigger the insert from the
> > external table. It is not very elegant but it seems to work.
> >
> > Now I have to ensure that the SQL query is successful before moving the
> file
> > to an other folder, otherwise I will end up with inconsistent data. I'm
> > currently using the GetHDFS/PutHDFS to move file around however it is not
> > possible to trigger the GetHDFS processor.
> >
> > What will be the best strategy to move the HDFS file only if a previous
> > event is successful? Any recommendation?
> >
> > Thanks for your help!
> >
> > Regards,
> >
> >
> >
> >
> >
>

Re: GetHDFS and triggering

Reply via email to