Just curious: Why is it necessary to identify that global state using catalog,
schema, table or column names? Those are database concepts. As a user, I'd
prefer to have some flexibility here. Surely the component needs to know how to
identify the database object, but beyond that, we should be free to choose how
to identify that "global"state ourselves. Would that be feasible, maybe as an
optional attribute that by default would be set with the database.table.column
I have another example, this time regarding the "while" attribute of
QueryDatabaseTable. Simply put, it's not enough. I have data that for some
reason was originally inserted out of order into a source table (or the
database just decided to show it like that), and without an "order by" clause I
can't avoid having gaps in the data I'm trying to extract from it now. Not a
total blocker since I could use a little SQL injection to work around the
limitation, but it should not be like that. What if one needs a more elaborate
"select", for example? I would be far preferable to trust the user and let them
inform the whole SQL command, of course with placeholders so NiFi could
properly set the values under its control. An optional attribute that would
take precedence over the "while" when informed would be enough. I'd expect that
kind of thing to be not so hard to implement, and it wold make the processor
much more powerful for more advanced users without complicating the lives of
those who are happy with a simpler configuration.
In general, when dealing with SQL and databases, the less assumptions we make
about how the component is going to be used, the better. SQL is very powerful,
and its power should be fully available to this kind of processor, IMHO.
What do you think?
On Tuesday, March 13, 2018 8:56 PM, Matt Burgess <mattyb...@apache.org>
Originally, we stored state only based on column names, which could obviously
cause problems when you have two different tables with the same column name.
However this was because the original DB Fetch processors (QueryDatabaseTable,
GenerateTableFetch, e.g.) did not accept incoming flow files and thus acted on
a single table. Then we added the ability for GenerateTableFetch to accept
incoming flow files, and realized we should be storing state at least based on
table name + column name, since GTF might get multiple tables in. I believe
you are running into the issue where we'd need to qualify the state based on
database name + table name + column name, and please feel free to write up a
Jira for that improvement. The cutoff at table name was a tradeoff against
complexity, as a database might not be fully-qualified by its name either
(imagine multiple hosts with replicated DBs inside, then how do we know that
two hosts don't point at the same place?).
For your use case, I think we'd need to store state by the aforementioned
"better-qualified" key, but that might be the limit to our name qualification.
We will have to deal with backwards-compatibility as we did before we added
table names, but since we have precedence I wouldn't think it would be too
difficult to implement.
As a workaround, you might try swapping QueryDatabaseTable with
GenerateTableFetch, and trying to distribute the flow files for a particular DB
to a particular instance of ExecuteSQL to actually fetch the rows. You should
be able to use RouteOnAttribute for this, assuming your table name is an
attribute on the flow file.
On Tue, Mar 13, 2018 at 8:30 PM, Ramaninder Singh Jhajj
I am facing an issue with QueryDatabaseTable processor.
I have 4 identical mySQL database tables on 4 different AWS instances. The
structure is same but data is different. What I am trying to do is, have 4
QueryDatabaseTable processors and fetch the data from all the 4 instances to
process it further and store in elasticsearch.
This is the structure of the flow, now my issue is:When any one of the
processors run, it stores the "Maximum-value Columns" value in the state as
shown below and this state is global in the cluster.
Now when second QueryDatabaseTable processor runs, it overwrites the state
value written by the first. I am facing issue with maintaining state for all 4
processors. Processors runs fine without any issue but obviously, the data
being fetches is not consistent as "id" column values gets overwritten in the
Is there any solution to this problem. I need to do incremental fetch to all 4
identical tables on 4 instances in a single flow and single cluster.
Please let me know if anyone faced similar problem and if there is any solution