Re:Re: Re: Re: Multiple dataflow jobs management(lots of jobs)

刘岩 Sun, 13 Mar 2016 11:17:23 -0700

Hi  Joe



Thanks for that clarify ， we have 3 more questions 



    does NIFI going to have a ambari metric API to collect monitoring data?  

  

    does the Designer and excutor  will be separated from each other?

  

    where can i find demos/examples for each processor？ 



Thank you very much



Yan LiuHortonworks Service Division

Richinfo, Shenzhen, China (PR)14/03/20







----邮件原文----发件人：Joe Witt  <[email protected]>收件人：users 
<[email protected]>抄　送: (无)发送时间：2016-03-14 01:14:40主题：Re: Re: Re: Multiple 
dataflow jobs management(lots of jobs)To clarify about 39HA and master node39 - 
that is for the control planeitself. The data continues to flow on all nodes 
even if the NCM isdown. That said, we are working to solve it now with 
zero-masterclustering.ThanksJoeOn Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:> Hi 
Thad>> Thank you very much for your advice. Kettle can do the job for sure , 
but> the metadata i was talking about is the metadata of the job descriptions> 
used for kettle itself. The only option left for kettle is multiple> instances 
, but that also means that we need to develop a master application> to gather 
all the instances metadata.>> Moreover , Kettle does not have a Web Based GUI 
for designing and testing> the job , that39s why we want NIFI , but again , 
multiple instances of nifi> also leads to a HA problem for master node, so we 
turn to ambari metrics for> that issue.>> Talend has a cloud server doing the 
similar thing, but it39s running on> public cloud which is not accepted by our 
client.>> Kettle is a great ETL tool, but Web Based designer is really the 
master> point for future.>>> Thank you very much>> Yan Liu>>> Yan Liu>> 
Hortonworks Service Division>> Richinfo, Shenzhen, China (PR)>> 14/03/2016>>>>> 
----邮件原文----> 发件人：Thad Guidry > 收件人：users > 抄　送: dev > 发送时间：2016-03-13 
23:04:39>> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)>> Yan,>> 
Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K> jobs 
to accomplish this is not the proper way to setup Pentaho. Also, using> MySQL 
to store the metadata is where you made a wrong choice. PostgreSQL> with data 
silos on SSD drives would be a better choice, while properly doing> Async 
config [1] and other necessary steps for high writes. Don39t keep> Pentaho39s 
Table output commit levels at their default of 10k rows when your> processing 
millions of rows!) For Oracle 11g or PostgreSQL, where I need 30> sec time 
slice windows for the metadata logging and where I typically have> less than 1k 
of data on average per row, I typically will choose 200k rows> or more in 
Pentaho39s table output commit option.>> I would suggest you contact Pentaho 
for some adhoc support or hire some> consultants to help you learn more, or 
setup properly for your use case.> For free, you can also just do a web search 
on "Pentaho best practices".> There39s a lot to learn from industry experts who 
already have used these> tools and know their quirks.>> [1]> 
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR>>>
 Thad> +ThadGuidry>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:>>>> Hi 
Aldrin>>>> some additional information.>>>> it39s a typical ETL offloading user 
case>>>> each extraction job should foucs on 1 table and 1 table only. data 
will>> be written on HDFS , this is similar to Database Staging.>>>> The reason 
why we need to foucs on 1 table for each job is because there>> might be 
database error or disconnection occur during the extraction , if>> it39s 
running as a script like extraction job with expression langurage,>> then it39s 
hard to do the re-running or excape thing on that table or tables.>>>> once the 
extraction is done, a triger like action will do the data>> cleansing. this is 
similar to ODS layer of Datawarehousing>>>> if the data quality has passed the 
quality check , then it will be marked>> as cleaned. otherwise , it will return 
to previous step and redo the data>> extraction, or send alert/email to the 
system administrator.>>>> if certain numbers of tables were all cleaned and 
checked , then it will>> call some Transforming processor to do the 
transforming ， then push the>> data into a datawarehouse (Hive in our 
case)>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service 
Division>>>> Richinfo, Shenzhen, China (PR)>>>> 13/03/2016>>>> ----邮件原文---->> 
发件人："刘岩" >> 收件人：users >> 抄　送: dev >> 发送时间：2016-03-13 00:12:27>> 主题：Re:Re: 
Multiple dataflow jobs management(lots of jobs)>>>>>> Hi Aldrin>>>> Currently 
we need to extract 60K tables per day , and the time window is>> limited to 8 
Hours. Which means that we need to run jobs concurrently , and>> we need a 
general description of what39s going on with all those 60K job>> flows and take 
further actions.>>>> We have tried Kettle and Talend , Talend is a IDE-Based so 
not what we>> are looking for, and Kettle was crashed due to the Mysql cannot 
handle the>> Kettle39s metadata with 10K jobs.>>>> So we want to use Nifi , 
this is really the product that we are looking>> for , but the missing piece 
here is a DataFlow jobs Admin Page. so we can>> have multiple Nifi instances 
running on different nodes, but monitoring the>> jobs in one page. If it can 
intergrate with Ambari metrics API, then we>> can develop an Ambari View for 
Nifi Jobs Monitoring just like HDFS View and>> Hive View.>>>>>> Thank you very 
much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China 
(PR)>>>> 06/03/2016>>>>>> ----邮件原文---->> 发件人：Aldrin Piri >> 收件人：users >> 抄　送: 
dev >> 发送时间：2016-03-11 02:27:11>> 主题：Re: Mutiple dataflow jobs management(lots 
of jobs)>>>> Hi Yan,>>>> We can get more into details and particulars if 
needed, but have you>> experimented with expression language? I could see a 
Cron driven approach>> which covers your periodic efforts that feeds some 
number of ExecuteSQL>> processors (perhaps one for each database you are 
communicating with) each>> having a table. This would certainly cut down on the 
need for the 30k>> processors on a one-to-one basis with a given processor.>>>> 
In terms of monitoring the dataflows, could you describe what else you are>> 
searching for beyond the graph view? NiFi tries to provide context for the>> 
flow of data but is not trying to be a sole monitoring, we can give>> 
information on a processor basis, but do not delve into specifics. There is>> a 
summary view for the overall flow where you can monitor stats about the>> 
components and connections in the system. We support interoperation with>> 
monitoring systems via push (ReportingTask) and pull (REST API [2])>> 
semantics.>>>> Any other details beyond your list of how this all interoperates 
might>> shed some more light on what you are trying to accomplish. It seems 
like>> NiFi should be able to help with this. With some additional information 
we>> may be able to provide further guidance or at least get some insights on 
use>> cases we could look to improve upon and extend NiFi to support.>>>> 
Thanks!>>>>>> [1]>> 
http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html>> 
[2]>> 
http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks>>
 [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html>>>> On Sat, Mar 
5, 2016 at 9:25 PM, 刘岩 wrote:>>>>>> Hi All>>>>>>>>> i39m trying to adapt Nifi 
to production but can not find an admin>>> console which monitoring the 
dataflows>>>>>>>>> The scenarios is simple,>>>>>>>>> 1. we gather data from 
oracle database to hdfs and then to hive.>>>>>> 2. residules/incrementals are 
updated daily or monthly via Nifi.>>>>>> 3. full dump on some table are excuted 
daily or monthly via Nifi.>>>>>>>>> is it really simple , however , we have 7 
oracle databases with>>> over 30K tables needs to implement the above 
scenario.>>>>>>>>> which means that i will drag that ExcuteSQL elements for 
like 30K time>>> or so and also need to place them with a nice looking way on 
my little 21>>> inch screen .>>>>>>>>> Just wondering if there is a table list 
like ,groupable and searchable>>> task control and monitoring feature for 
Nifi>>>>>>>>>>>> Thank you very much in advance>>>>>>>>>>>> Yan Liu>>>>>> 
Hortonworks Service Division>>>>>> Richinfo, Shenzhen, China (PR)>>>>>> 
06/03/2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Subject：Re: Re: Re: Multiple dataflow 
jobs management(lots of jobs)To clarify about 39HA and master node39 - that is 
for the control planeitself. The data continues to flow on all nodes even if 
the NCM isdown. That said, we are working to solve it now with 
zero-masterclustering.ThanksJoeOn Sun, Mar 13, 2016 at 12:20 PM, 刘岩 wrote:> Hi 
Thad>> Thank you very much for your advice. Kettle can do the job for sure , 
but> the metadata i was talking about is the metadata of the job descriptions> 
used for kettle itself. The only option left for kettle is multiple> instances 
, but that also means that we need to develop a master application> to gather 
all the instances metadata.>> Moreover , Kettle does not have a Web Based GUI 
for designing and testing> the job , that39s why we want NIFI , but again , 
multiple instances of nifi> also leads to a HA problem for master node, so we 
turn to ambari metrics for> that issue.>> Talend has a cloud server doing the 
similar thing, but it39s running on> public cloud which is not accepted by our 
client.>> Kettle is a great ETL tool, but Web Based designer is really the 
master> point for future.>>> Thank you very much>> Yan Liu>>> Yan Liu>> 
Hortonworks Service Division>> Richinfo, Shenzhen, China (PR)>> 14/03/2016>>>>> 
----邮件原文----> 发件人：Thad Guidry > 收件人：users > 抄　送: dev > 发送时间：2016-03-13 
23:04:39>> 主题：Re: Re: Multiple dataflow jobs management(lots of jobs)>> Yan,>> 
Pentaho Kettle (PDI) can also certainly handle your needs. But using 10K> jobs 
to accomplish this is not the proper way to setup Pentaho. Also, using> MySQL 
to store the metadata is where you made a wrong choice. PostgreSQL> with data 
silos on SSD drives would be a better choice, while properly doing> Async 
config [1] and other necessary steps for high writes. Don39t keep> Pentaho39s 
Table output commit levels at their default of 10k rows when your> processing 
millions of rows!) For Oracle 11g or PostgreSQL, where I need 30> sec time 
slice windows for the metadata logging and where I typically have> less than 1k 
of data on average per row, I typically will choose 200k rows> or more in 
Pentaho39s table output commit option.>> I would suggest you contact Pentaho 
for some adhoc support or hire some> consultants to help you learn more, or 
setup properly for your use case.> For free, you can also just do a web search 
on "Pentaho best practices".> There39s a lot to learn from industry experts who 
already have used these> tools and know their quirks.>> [1]> 
http://www.postgresql.org/docs/9.5/interactive/runtime-config-resource.html#RUNTIME-CONFIG-RESOURCE-ASYNC-BEHAVIOR>>>
 Thad> +ThadGuidry>> On Sat, Mar 12, 2016 at 11:00 AM, 刘岩 wrote:>>>> Hi 
Aldrin>>>> some additional information.>>>> it39s a typical ETL offloading user 
case>>>> each extraction job should foucs on 1 table and 1 table only. data 
will>> be written on HDFS , this is similar to Database Staging.>>>> The reason 
why we need to foucs on 1 table for each job is because there>> might be 
database error or disconnection occur during the extraction , if>> it39s 
running as a script like extraction job with expression langurage,>> then it39s 
hard to do the re-running or excape thing on that table or tables.>>>> once the 
extraction is done, a triger like action will do the data>> cleansing. this is 
similar to ODS layer of Datawarehousing>>>> if the data quality has passed the 
quality check , then it will be marked>> as cleaned. otherwise , it will return 
to previous step and redo the data>> extraction, or send alert/email to the 
system administrator.>>>> if certain numbers of tables were all cleaned and 
checked , then it will>> call some Transforming processor to do the 
transforming ， then push the>> data into a datawarehouse (Hive in our 
case)>>>>>> Thank you very much>>>> Yan Liu>>>> Hortonworks Service 
Division>>>> Richinfo, Shenzhen, China (PR)>>>> 13/03/2016>>>> ----邮件原文---->> 
发件人："刘岩" >> 收件人：users >> 抄　送: dev >> 发送时间：2016-03-13 00:12:27>> 主题：Re:Re: 
Multiple dataflow jobs management(lots of jobs)>>>>>> Hi Aldrin>>>> Currently 
we need to extract 60K tables per day , and the time window is>> limited to 8 
Hours. Which means that we need to run jobs concurrently , and>> we need a 
general description of what39s going on with all those 60K job>> flows and take 
further actions.>>>> We have tried Kettle and Talend , Talend is a IDE-Based so 
not what we>> are looking for, and Kettle was crashed due to the Mysql cannot 
handle the>> Kettle39s metadata with 10K jobs.>>>> So we want to use Nifi , 
this is really the product that we are looking>> for , but the missing piece 
here is a DataFlow jobs Admin Page. so we can>> have multiple Nifi instances 
running on different nodes, but monitoring the>> jobs in one page. If it can 
intergrate with Ambari metrics API, then we>> can develop an Ambari View for 
Nifi Jobs Monitoring just like HDFS View and>> Hive View.>>>>>> Thank you very 
much>>>> Yan Liu>>>> Hortonworks Service Division>>>> Richinfo, Shenzhen, China 
(PR)>>>> 06/03/2016>>>>>> ----邮件原文---->> 发件人：Aldrin Piri >> 收件人：users >> 抄　送: 
dev >> 发送时间：2016-03-11 02:27:11>> 主题：Re: Mutiple dataflow jobs management(lots 
of jobs)>>>> Hi Yan,>>>> We can get more into details and particulars if 
needed, but have you>> experimented with expression language? I could see a 
Cron driven approach>> which covers your periodic efforts that feeds some 
number of ExecuteSQL>> processors (perhaps one for each database you are 
communicating with) each>> having a table. This would certainly cut down on the 
need for the 30k>> processors on a one-to-one basis with a given processor.>>>> 
In terms of monitoring the dataflows, could you describe what else you are>> 
searching for beyond the graph view? NiFi tries to provide context for the>> 
flow of data but is not trying to be a sole monitoring, we can give>> 
information on a processor basis, but do not delve into specifics. There is>> a 
summary view for the overall flow where you can monitor stats about the>> 
components and connections in the system. We support interoperation with>> 
monitoring systems via push (ReportingTask) and pull (REST API [2])>> 
semantics.>>>> Any other details beyond your list of how this all interoperates 
might>> shed some more light on what you are trying to accomplish. It seems 
like>> NiFi should be able to help with this. With some additional information 
we>> may be able to provide further guidance or at least get some insights on 
use>> cases we could look to improve upon and extend NiFi to support.>>>> 
Thanks!>>>>>> [1]>> 
http://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html>> 
[2]>> 
http://nifi.apache.org/docs/nifi-docs/html/developer-guide.html#reporting-tasks>>
 [3] http://nifi.apache.org/docs/nifi-docs/rest-api/index.html>>>> On Sat, Mar 
5, 2016 at 9:25 PM, 刘岩 wrote:>>>>>> Hi All>>>>>>>>> i39m trying to adapt Nifi 
to production but can not find an admin>>> console which monitoring the 
dataflows>>>>>>>>> The scenarios is simple,>>>>>>>>> 1. we gather data from 
oracle database to hdfs and then to hive.>>>>>> 2. residules/incrementals are 
updated daily or monthly via Nifi.>>>>>> 3. full dump on some table are excuted 
daily or monthly via Nifi.>>>>>>>>> is it really simple , however , we have 7 
oracle databases with>>> over 30K tables needs to implement the above 
scenario.>>>>>>>>> which means that i will drag that ExcuteSQL elements for 
like 30K time>>> or so and also need to place them with a nice looking way on 
my little 21>>> inch screen .>>>>>>>>> Just wondering if there is a table list 
like ,groupable and searchable>>> task control and monitoring feature for 
Nifi>>>>>>>>>>>> Thank you very much in advance>>>>>>>>>>>> Yan Liu>>>>>> 
Hortonworks Service Division>>>>>> Richinfo, Shenzhen, China (PR)>>>>>> 
06/03/2016>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Re:Re: Re: Re: Multiple dataflow jobs management(lots of jobs)

Reply via email to