Hi Joe
Thanks for that clarify , we have 3 more questions
does NIFI going to have a ambari metric API to collect monitoring data?
does the Designer and excutor will be separated from each other?
where can i find demos/examples for each processor?
Thank you very much
Yan LiuHortonworks Service Division
Richinfo, Shenzhen, China (PR)14/03/20
----邮件原文----发件人:Joe Witt <[email protected]>收件人:users
<[email protected]>抄 送: (无)发送时间:2016-03-14 01:14:40主题:Re: Re: Re: Multiple
dataflow jobs management(lots of jobs)To clarify about 39HA and master node39 -
that is for the control planeitself. The data continues to flow on all nodes
even if the NCM isdown. That said, we are working to solve it now with
zero-masterclustering.ThanksJoeOn Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:> Hi
Thad>> Thank you very much for your advice. Kettle can do the job for sure ,
but> the metadata i was talking about is the metadata of the job descriptions>
used for kettle itself. The only option left for kettle is multiple> instances
, but that also means that we need to develop a master application> to gather
all the instances metadata.>> Moreover , Kettle does not have a Web Based GUI
for designing and testing> the job , that39s why we want NIFI , but again ,
multiple instances of nifi> also leads to a HA problem for master node, so we
turn to ambari metrics for> that issue.>> Talend has a cloud server doing the
similar thing, but it39s running on> public cloud which is not accepted by our
client.>> Kettle is a great ETL tool, but Web Based designer is really the
master> point for future.>>> Thank you very much>> Yan Liu>>> Yan Liu>>
Hortonworks Service Division>> Richinfo, Shenzhen, China (PR)>> 14/03/2016>>>>>
----邮件原文----> 发件人:Thad Guidry > 收件人:users > 抄 送: dev > 发送时间:2016-03-13
23:04:39>> 主题:Re: Re: Multiple dataflow jobs management(lots of jobs)>> Yan,>>
Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K> jobs
to accomplish this is not the proper way to setup Pentaho. Also, using> MySQL
to store the metadata is where you made a wrong choice. PostgreSQL> with data
silos on SSD drives would be a better choice, while properly doing> Async
config [1] and other necessary steps for high writes. Don39t keep> Pentaho39s
Table output commit levels at their default of 10k rows when your> processing
millions of rows!) For Oracle 11g or PostgreSQL, where I need 30> sec time
slice windows for the metadata logging and where I typically have> less than 1k
of data on average per row, I typically will choose 200k rows> or more in
Pentaho39s table output commit option.>> I would suggest you contact Pentaho
for some adhoc support or hire some> consultants to help you learn more, or
setup properly for your use case.> For free, you can also just do a web search
on "Pentaho best practices".> There39s a lot to learn from industry experts who
already have used these> tools and know their quirks.>> [1]>
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR>>>
Thad> +ThadGuidry>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:>>>> Hi
Aldrin>>>> some additional information.>>>> it39s a typical ETL offloading user
case>>>> each extraction job should foucs on 1 table and 1 table only. data
will>> be written on HDFS , this is similar to Database Staging.>>>> The reason
why we need to foucs on 1 table for each job is because there>> might be
database error or disconnection occur during the extraction , if>> it39s
running as a script like extraction job with expression langurage,>> then it39s
hard to do the re-running or excape thing on that table or tables.>>>> once the
extraction is done, a triger like action will do the data>> cleansing. this is
similar to ODS layer of Datawarehousing>>>> if the data quality has passed the
quality check , then it will be marked>> as cleaned. otherwise , it will return
to previous step and redo the data>> extraction, or send alert/email to the
system administrator.>>>> if certain numbers of tables were all cleaned and
checked , then it will>> call some Transforming processor to do the
transforming , then push the>> data into a datawarehouse (Hive in our
case)>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service
Division>>>> Richinfo, Shenzhen, China (PR)>>>> 13/03/2016>>>> ----邮件原文---->>
发件人:"刘岩" >> 收件人:users >> 抄 送: dev >> 发送时间:2016-03-13 00:12:27>> 主题:Re:Re:
Multiple dataflow jobs management(lots of jobs)>>>>>> Hi Aldrin>>>> Currently
we need to extract 60K tables per day , and the time window is>> limited to 8
Hours. Which means that we need to run jobs concurrently , and>> we need a
general description of what39s going on with all those 60K job>> flows and take
further actions.>>>> We have tried Kettle and Talend , Talend is a IDE-Based so
not what we>> are looking for, and Kettle was crashed due to the Mysql cannot
handle the>> Kettle39s metadata with 10K jobs.>>>> So we want to use Nifi ,
this is really the product that we are looking>> for , but the missing piece
here is a DataFlow jobs Admin Page. so we can>> have multiple Nifi instances
running on different nodes, but monitoring the>> jobs in one page. If it can
intergrate with Ambari metrics API, then we>> can develop an Ambari View for
Nifi Jobs Monitoring just like HDFS View and>> Hive View.>>>>>> Thank you very
much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China
(PR)>>>> 06/03/2016>>>>>> ----邮件原文---->> 发件人:Aldrin Piri >> 收件人:users >> 抄 送:
dev >> 发送时间:2016-03-11 02:27:11>> 主题:Re: Mutiple dataflow jobs management(lots
of jobs)>>>> Hi Yan,>>>> We can get more into details and particulars if
needed, but have you>> experimented with expression language? I could see a
Cron driven approach>> which covers your periodic efforts that feeds some
number of ExecuteSQL>> processors (perhaps one for each database you are
communicating with) each>> having a table. This would certainly cut down on the
need for the 30k>> processors on a one-to-one basis with a given processor.>>>>
In terms of monitoring the dataflows, could you describe what else you are>>
searching for beyond the graph view? NiFi tries to provide context for the>>
flow of data but is not trying to be a sole monitoring, we can give>>
information on a processor basis, but do not delve into specifics. There is>> a
summary view for the overall flow where you can monitor stats about the>>
components and connections in the system. We support interoperation with>>
monitoring systems via push (ReportingTask) and pull (REST API [2])>>
semantics.>>>> Any other details beyond your list of how this all interoperates
might>> shed some more light on what you are trying to accomplish. It seems
like>> NiFi should be able to help with this. With some additional information
we>> may be able to provide further guidance or at least get some insights on
use>> cases we could look to improve upon and extend NiFi to support.>>>>
Thanks!>>>>>> [1]>>
http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html>>
[2]>>
http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks>>
[3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html>>>> On Sat, Mar
5, 2016 at 9:25 PM, 刘岩 wrote:>>>>>> Hi All>>>>>>>>> i39m trying to adapt Nifi
to production but can not find an admin>>> console which monitoring the
dataflows>>>>>>>>> The scenarios is simple,>>>>>>>>> 1. we gather data from
oracle database to hdfs and then to hive.>>>>>> 2. residules/incrementals are
updated daily or monthly via Nifi.>>>>>> 3. full dump on some table are excuted
daily or monthly via Nifi.>>>>>>>>> is it really simple , however , we have 7
oracle databases with>>> over 30K tables needs to implement the above
scenario.>>>>>>>>> which means that i will drag that ExcuteSQL elements for
like 30K time>>> or so and also need to place them with a nice looking way on
my little 21>>> inch screen .>>>>>>>>> Just wondering if there is a table list
like ,groupable and searchable>>> task control and monitoring feature for
Nifi>>>>>>>>>>>> Thank you very much in advance>>>>>>>>>>>> Yan Liu>>>>>>
Hortonworks Service Division>>>>>> Richinfo, Shenzhen, China (PR)>>>>>>
06/03/2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Subject:Re: Re: Re: Multiple dataflow
jobs management(lots of jobs)To clarify about 39HA and master node39 - that is
for the control planeitself. The data continues to flow on all nodes even if
the NCM isdown. That said, we are working to solve it now with
zero-masterclustering.ThanksJoeOn Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:> Hi
Thad>> Thank you very much for your advice. Kettle can do the job for sure ,
but> the metadata i was talking about is the metadata of the job descriptions>
used for kettle itself. The only option left for kettle is multiple> instances
, but that also means that we need to develop a master application> to gather
all the instances metadata.>> Moreover , Kettle does not have a Web Based GUI
for designing and testing> the job , that39s why we want NIFI , but again ,
multiple instances of nifi> also leads to a HA problem for master node, so we
turn to ambari metrics for> that issue.>> Talend has a cloud server doing the
similar thing, but it39s running on> public cloud which is not accepted by our
client.>> Kettle is a great ETL tool, but Web Based designer is really the
master> point for future.>>> Thank you very much>> Yan Liu>>> Yan Liu>>
Hortonworks Service Division>> Richinfo, Shenzhen, China (PR)>> 14/03/2016>>>>>
----邮件原文----> 发件人:Thad Guidry > 收件人:users > 抄 送: dev > 发送时间:2016-03-13
23:04:39>> 主题:Re: Re: Multiple dataflow jobs management(lots of jobs)>> Yan,>>
Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K> jobs
to accomplish this is not the proper way to setup Pentaho. Also, using> MySQL
to store the metadata is where you made a wrong choice. PostgreSQL> with data
silos on SSD drives would be a better choice, while properly doing> Async
config [1] and other necessary steps for high writes. Don39t keep> Pentaho39s
Table output commit levels at their default of 10k rows when your> processing
millions of rows!) For Oracle 11g or PostgreSQL, where I need 30> sec time
slice windows for the metadata logging and where I typically have> less than 1k
of data on average per row, I typically will choose 200k rows> or more in
Pentaho39s table output commit option.>> I would suggest you contact Pentaho
for some adhoc support or hire some> consultants to help you learn more, or
setup properly for your use case.> For free, you can also just do a web search
on "Pentaho best practices".> There39s a lot to learn from industry experts who
already have used these> tools and know their quirks.>> [1]>
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR>>>
Thad> +ThadGuidry>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:>>>> Hi
Aldrin>>>> some additional information.>>>> it39s a typical ETL offloading user
case>>>> each extraction job should foucs on 1 table and 1 table only. data
will>> be written on HDFS , this is similar to Database Staging.>>>> The reason
why we need to foucs on 1 table for each job is because there>> might be
database error or disconnection occur during the extraction , if>> it39s
running as a script like extraction job with expression langurage,>> then it39s
hard to do the re-running or excape thing on that table or tables.>>>> once the
extraction is done, a triger like action will do the data>> cleansing. this is
similar to ODS layer of Datawarehousing>>>> if the data quality has passed the
quality check , then it will be marked>> as cleaned. otherwise , it will return
to previous step and redo the data>> extraction, or send alert/email to the
system administrator.>>>> if certain numbers of tables were all cleaned and
checked , then it will>> call some Transforming processor to do the
transforming , then push the>> data into a datawarehouse (Hive in our
case)>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service
Division>>>> Richinfo, Shenzhen, China (PR)>>>> 13/03/2016>>>> ----邮件原文---->>
发件人:"刘岩" >> 收件人:users >> 抄 送: dev >> 发送时间:2016-03-13 00:12:27>> 主题:Re:Re:
Multiple dataflow jobs management(lots of jobs)>>>>>> Hi Aldrin>>>> Currently
we need to extract 60K tables per day , and the time window is>> limited to 8
Hours. Which means that we need to run jobs concurrently , and>> we need a
general description of what39s going on with all those 60K job>> flows and take
further actions.>>>> We have tried Kettle and Talend , Talend is a IDE-Based so
not what we>> are looking for, and Kettle was crashed due to the Mysql cannot
handle the>> Kettle39s metadata with 10K jobs.>>>> So we want to use Nifi ,
this is really the product that we are looking>> for , but the missing piece
here is a DataFlow jobs Admin Page. so we can>> have multiple Nifi instances
running on different nodes, but monitoring the>> jobs in one page. If it can
intergrate with Ambari metrics API, then we>> can develop an Ambari View for
Nifi Jobs Monitoring just like HDFS View and>> Hive View.>>>>>> Thank you very
much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China
(PR)>>>> 06/03/2016>>>>>> ----邮件原文---->> 发件人:Aldrin Piri >> 收件人:users >> 抄 送:
dev >> 发送时间:2016-03-11 02:27:11>> 主题:Re: Mutiple dataflow jobs management(lots
of jobs)>>>> Hi Yan,>>>> We can get more into details and particulars if
needed, but have you>> experimented with expression language? I could see a
Cron driven approach>> which covers your periodic efforts that feeds some
number of ExecuteSQL>> processors (perhaps one for each database you are
communicating with) each>> having a table. This would certainly cut down on the
need for the 30k>> processors on a one-to-one basis with a given processor.>>>>
In terms of monitoring the dataflows, could you describe what else you are>>
searching for beyond the graph view? NiFi tries to provide context for the>>
flow of data but is not trying to be a sole monitoring, we can give>>
information on a processor basis, but do not delve into specifics. There is>> a
summary view for the overall flow where you can monitor stats about the>>
components and connections in the system. We support interoperation with>>
monitoring systems via push (ReportingTask) and pull (REST API [2])>>
semantics.>>>> Any other details beyond your list of how this all interoperates
might>> shed some more light on what you are trying to accomplish. It seems
like>> NiFi should be able to help with this. With some additional information
we>> may be able to provide further guidance or at least get some insights on
use>> cases we could look to improve upon and extend NiFi to support.>>>>
Thanks!>>>>>> [1]>>
http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html>>
[2]>>
http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks>>
[3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html>>>> On Sat, Mar
5, 2016 at 9:25 PM, 刘岩 wrote:>>>>>> Hi All>>>>>>>>> i39m trying to adapt Nifi
to production but can not find an admin>>> console which monitoring the
dataflows>>>>>>>>> The scenarios is simple,>>>>>>>>> 1. we gather data from
oracle database to hdfs and then to hive.>>>>>> 2. residules/incrementals are
updated daily or monthly via Nifi.>>>>>> 3. full dump on some table are excuted
daily or monthly via Nifi.>>>>>>>>> is it really simple , however , we have 7
oracle databases with>>> over 30K tables needs to implement the above
scenario.>>>>>>>>> which means that i will drag that ExcuteSQL elements for
like 30K time>>> or so and also need to place them with a nice looking way on
my little 21>>> inch screen .>>>>>>>>> Just wondering if there is a table list
like ,groupable and searchable>>> task control and monitoring feature for
Nifi>>>>>>>>>>>> Thank you very much in advance>>>>>>>>>>>> Yan Liu>>>>>>
Hortonworks Service Division>>>>>> Richinfo, Shenzhen, China (PR)>>>>>>
06/03/2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>