Originally, we stored state only based on column names, which could
obviously cause problems when you have two different tables with the same
column name. However this was because the original DB Fetch processors
(QueryDatabaseTable, GenerateTableFetch, e.g.) did not accept incoming flow
files and thus acted on a single table. Then we added the ability for
GenerateTableFetch to accept incoming flow files, and realized we should be
storing state at least based on table name + column name, since GTF might
get multiple tables in. I believe you are running into the issue where
we'd need to qualify the state based on database name + table name + column
name, and please feel free to write up a Jira for that improvement. The
cutoff at table name was a tradeoff against complexity, as a database might
not be fully-qualified by its name either (imagine multiple hosts with
replicated DBs inside, then how do we know that two hosts don't point at
the same place?).
For your use case, I think we'd need to store state by the aforementioned
"better-qualified" key, but that might be the limit to our name
qualification. We will have to deal with backwards-compatibility as we did
before we added table names, but since we have precedence I wouldn't think
it would be too difficult to implement.
As a workaround, you might try swapping QueryDatabaseTable with
GenerateTableFetch, and trying to distribute the flow files for a
particular DB to a particular instance of ExecuteSQL to actually fetch the
rows. You should be able to use RouteOnAttribute for this, assuming your
table name is an attribute on the flow file.
On Tue, Mar 13, 2018 at 8:30 PM, Ramaninder Singh Jhajj <
> Hello Everyone,
> I am facing an issue with QueryDatabaseTable processor.
> I have 4 identical mySQL database tables on 4 different AWS instances. The
> structure is same but data is different. What I am trying to do is, have 4
> QueryDatabaseTable processors and fetch the data from all the 4 instances
> to process it further and store in elasticsearch.
> This is the structure of the flow, now my issue is:
> *When any one of the processors run, it stores the "Maximum-value Columns"
> value in the state as shown below and this state is global in the cluster. *
> *Now when second QueryDatabaseTable processor runs, it overwrites the
> state value written by the first. I am facing issue with maintaining state
> for all 4 processors. Processors runs fine without any issue but obviously,
> the data being fetches is not consistent as "id" column values gets
> overwritten in the state.*
> Is there any solution to this problem. I need to do incremental fetch to
> all 4 identical tables on 4 instances in a single flow and single cluster.
> Please let me know if anyone faced similar problem and if there is any
> solution for this.
> Kind Regards,