Re: [VOTE] Bring Griffin to Apache Incubator
+1(non-binding) 发自我的 iPhone > 在 2016年12月2日,上午9:00,Kasper Sørensen 写道: > > +1 (binding) > > 2016-12-01 17:58 GMT-08:00 Julian Hyde : > >> +1 (binding) >> >>> On Dec 1, 2016, at 3:30 PM, Liang Chen wrote: >>> >>> Hi >>> >>> +1(non-binding) >>> >>> Regards >>> Liang >>> >>> Henry Saputra wrote Hi All, As the champion for Griffin, I would like to start VOTE to bring the project as Apache incubator podling. Here is the direct quote from the abstract: " Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context. " Please cast your vote: [ ] +1, bring Griffin into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Griffin into Incubator, because... This vote will be open at least for 72 hours and only votes from the Incubator PMC are binding. The VOTE will end 12/5 9am PST to pass through weekend. Here is the link to the proposal: https://wiki.apache.org/incubator/GriffinProposal I have copied the proposal below for easy access Thanks, - Henry Griffin Proposal Abstract Griffin is a Data Quality Service platform built on Apache Hadoop and Apache Spark. It provides a framework process for defining data quality model, executing data quality measurement, automating data profiling and validation, as well as a unified data quality visualization across multiple data systems. It tries to address the data quality challenges in big data and streaming context. Proposal Griffin is a open source Data Quality solution for distributed data systems at any scale in both streaming or batch data context. When people use open source products (e.g. Apache Hadoop, Apache Spark, Apache Kafka, Apache Storm), they always need a data quality service to build his/her confidence on data quality processed by those platforms. Griffin creates a unified process to define and construct data quality measurement pipeline across multiple data systems to provide: Automatic quality validation of the data Data profiling and anomaly detection Data quality lineage from upstream to downstream data systems. Data quality health monitoring visualization Shared infrastructure resource management Overview of Griffin Griffin has been deployed in production at eBay serving major data systems, it takes a platform approach to provide generic features to solve common data quality validation pain points. Firstly, user can register the data asset which user wants to do data quality check. The data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop system or near real-time streaming data from Apache Kafka, Apache Storm and other real time data platforms. Secondly, user can create data quality model to define the data quality rule and metadata. Thirdly, the model or rule will be executed automatically (by the model engine) to get the sample data quality validation results in a few seconds for streaming data. Finally, user can analyze the data quality results through built-in visualization tool to take actions. Griffin includes: Data Quality Model Engine Griffin is model driven solution, user can choose various data quality dimension to execute his/her data quality validation based on selected target data-set or source data-set ( as the golden reference data). It has a corresponding library supporting it in back-end for the following measurement: Accuracy - Does data reflect the real-world objects or a verifiable >> source Completeness - Is all necessary data present Validity - Are all data values within the data domains specified by the business Timeliness - Is the data available at the time needed Anomaly detection - Pre-built algorithm functions for the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset Data Profiling - Apply statistical analysis and assessment of data values within a dataset for consistency, uniqueness and logic. Data Collection Layer We support two kinds of data sources, batch data and real time data. For batch mode, we can collect data source from Apache Hadoop based platform by various data connectors. For real time mode, we can conn
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) 2016-12-01 17:58 GMT-08:00 Julian Hyde : > +1 (binding) > > > On Dec 1, 2016, at 3:30 PM, Liang Chen wrote: > > > > Hi > > > > +1(non-binding) > > > > Regards > > Liang > > > > Henry Saputra wrote > >> Hi All, > >> > >> As the champion for Griffin, I would like to start VOTE to bring the > >> project as Apache incubator podling. > >> > >> Here is the direct quote from the abstract: > >> > >> " > >> Griffin is a Data Quality Service platform built on Apache Hadoop and > >> Apache Spark. It provides a framework process for defining data > >> quality model, executing data quality measurement, automating data > >> profiling and validation, as well as a unified data quality > >> visualization across multiple data systems. It tries to address the > >> data quality challenges in big data and streaming context. > >> " > >> > >> Please cast your vote: > >> > >> [ ] +1, bring Griffin into Incubator > >> [ ] +0, I don't care either way, > >> [ ] -1, do not bring Griffin into Incubator, because... > >> > >> This vote will be open at least for 72 hours and only votes from the > >> Incubator PMC are binding. > >> > >> The VOTE will end 12/5 9am PST to pass through weekend. > >> > >> > >> Here is the link to the proposal: > >> > >> https://wiki.apache.org/incubator/GriffinProposal > >> > >> I have copied the proposal below for easy access > >> > >> > >> Thanks, > >> > >> - Henry > >> > >> > >> > >> Griffin Proposal > >> > >> Abstract > >> > >> Griffin is a Data Quality Service platform built on Apache Hadoop and > >> Apache Spark. It provides a framework process for defining data > >> quality model, executing data quality measurement, automating data > >> profiling and validation, as well as a unified data quality > >> visualization across multiple data systems. It tries to address the > >> data quality challenges in big data and streaming context. > >> > >> Proposal > >> > >> Griffin is a open source Data Quality solution for distributed data > >> systems at any scale in both streaming or batch data context. When > >> people use open source products (e.g. Apache Hadoop, Apache Spark, > >> Apache Kafka, Apache Storm), they always need a data quality service > >> to build his/her confidence on data quality processed by those > >> platforms. Griffin creates a unified process to define and construct > >> data quality measurement pipeline across multiple data systems to > >> provide: > >> > >> Automatic quality validation of the data > >> Data profiling and anomaly detection > >> Data quality lineage from upstream to downstream data systems. > >> Data quality health monitoring visualization > >> Shared infrastructure resource management > >> > >> Overview of Griffin > >> > >> Griffin has been deployed in production at eBay serving major data > >> systems, it takes a platform approach to provide generic features to > >> solve common data quality validation pain points. Firstly, user can > >> register the data asset which user wants to do data quality check. The > >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > >> system or near real-time streaming data from Apache Kafka, Apache > >> Storm and other real time data platforms. Secondly, user can create > >> data quality model to define the data quality rule and metadata. > >> Thirdly, the model or rule will be executed automatically (by the > >> model engine) to get the sample data quality validation results in a > >> few seconds for streaming data. Finally, user can analyze the data > >> quality results through built-in visualization tool to take actions. > >> > >> Griffin includes: > >> > >> Data Quality Model Engine > >> > >> Griffin is model driven solution, user can choose various data quality > >> dimension to execute his/her data quality validation based on selected > >> target data-set or source data-set ( as the golden reference data). It > >> has a corresponding library supporting it in back-end for the > >> following measurement: > >> > >> Accuracy - Does data reflect the real-world objects or a verifiable > source > >> Completeness - Is all necessary data present > >> Validity - Are all data values within the data domains specified by the > >> business > >> Timeliness - Is the data available at the time needed > >> Anomaly detection - Pre-built algorithm functions for the > >> identification of items, events or observations which do not conform > >> to an expected pattern or other items in a dataset > >> Data Profiling - Apply statistical analysis and assessment of data > >> values within a dataset for consistency, uniqueness and logic. > >> > >> Data Collection Layer > >> > >> We support two kinds of data sources, batch data and real time data. > >> > >> For batch mode, we can collect data source from Apache Hadoop based > >> platform by various data connectors. > >> > >> For real time mode, we can connect with messaging system like Kafka to > >> near real time analysis. > >> > >> Data Process and Storage Layer > >> > >> For
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) > On Dec 1, 2016, at 3:30 PM, Liang Chen wrote: > > Hi > > +1(non-binding) > > Regards > Liang > > Henry Saputra wrote >> Hi All, >> >> As the champion for Griffin, I would like to start VOTE to bring the >> project as Apache incubator podling. >> >> Here is the direct quote from the abstract: >> >> " >> Griffin is a Data Quality Service platform built on Apache Hadoop and >> Apache Spark. It provides a framework process for defining data >> quality model, executing data quality measurement, automating data >> profiling and validation, as well as a unified data quality >> visualization across multiple data systems. It tries to address the >> data quality challenges in big data and streaming context. >> " >> >> Please cast your vote: >> >> [ ] +1, bring Griffin into Incubator >> [ ] +0, I don't care either way, >> [ ] -1, do not bring Griffin into Incubator, because... >> >> This vote will be open at least for 72 hours and only votes from the >> Incubator PMC are binding. >> >> The VOTE will end 12/5 9am PST to pass through weekend. >> >> >> Here is the link to the proposal: >> >> https://wiki.apache.org/incubator/GriffinProposal >> >> I have copied the proposal below for easy access >> >> >> Thanks, >> >> - Henry >> >> >> >> Griffin Proposal >> >> Abstract >> >> Griffin is a Data Quality Service platform built on Apache Hadoop and >> Apache Spark. It provides a framework process for defining data >> quality model, executing data quality measurement, automating data >> profiling and validation, as well as a unified data quality >> visualization across multiple data systems. It tries to address the >> data quality challenges in big data and streaming context. >> >> Proposal >> >> Griffin is a open source Data Quality solution for distributed data >> systems at any scale in both streaming or batch data context. When >> people use open source products (e.g. Apache Hadoop, Apache Spark, >> Apache Kafka, Apache Storm), they always need a data quality service >> to build his/her confidence on data quality processed by those >> platforms. Griffin creates a unified process to define and construct >> data quality measurement pipeline across multiple data systems to >> provide: >> >> Automatic quality validation of the data >> Data profiling and anomaly detection >> Data quality lineage from upstream to downstream data systems. >> Data quality health monitoring visualization >> Shared infrastructure resource management >> >> Overview of Griffin >> >> Griffin has been deployed in production at eBay serving major data >> systems, it takes a platform approach to provide generic features to >> solve common data quality validation pain points. Firstly, user can >> register the data asset which user wants to do data quality check. The >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop >> system or near real-time streaming data from Apache Kafka, Apache >> Storm and other real time data platforms. Secondly, user can create >> data quality model to define the data quality rule and metadata. >> Thirdly, the model or rule will be executed automatically (by the >> model engine) to get the sample data quality validation results in a >> few seconds for streaming data. Finally, user can analyze the data >> quality results through built-in visualization tool to take actions. >> >> Griffin includes: >> >> Data Quality Model Engine >> >> Griffin is model driven solution, user can choose various data quality >> dimension to execute his/her data quality validation based on selected >> target data-set or source data-set ( as the golden reference data). It >> has a corresponding library supporting it in back-end for the >> following measurement: >> >> Accuracy - Does data reflect the real-world objects or a verifiable source >> Completeness - Is all necessary data present >> Validity - Are all data values within the data domains specified by the >> business >> Timeliness - Is the data available at the time needed >> Anomaly detection - Pre-built algorithm functions for the >> identification of items, events or observations which do not conform >> to an expected pattern or other items in a dataset >> Data Profiling - Apply statistical analysis and assessment of data >> values within a dataset for consistency, uniqueness and logic. >> >> Data Collection Layer >> >> We support two kinds of data sources, batch data and real time data. >> >> For batch mode, we can collect data source from Apache Hadoop based >> platform by various data connectors. >> >> For real time mode, we can connect with messaging system like Kafka to >> near real time analysis. >> >> Data Process and Storage Layer >> >> For batch analysis, our data quality model will compute data quality >> metrics in our spark cluster based on data source in Apache Hadoop. >> >> For near real time analysis, we consume data from messaging system, >> then our data quality model will compute our real time data qual
Re: [VOTE] Bring Griffin to Apache Incubator
Hi +1(non-binding) Regards Liang Henry Saputra wrote > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back end to fulfill front end request. > > Griffin Service > > We have RESTful web services to accomplish all t
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra wrote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back end to fulfill front end request. > > Griffin Service > > We have RESTful web services to accomplish all the functionalities of >
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) Regards, Uma On 11/30/16, 10:40 PM, "Henry Saputra" wrote: >Hi All, > >As the champion for Griffin, I would like to start VOTE to bring the >project as Apache incubator podling. > >Here is the direct quote from the abstract: > >" >Griffin is a Data Quality Service platform built on Apache Hadoop and >Apache Spark. It provides a framework process for defining data >quality model, executing data quality measurement, automating data >profiling and validation, as well as a unified data quality >visualization across multiple data systems. It tries to address the >data quality challenges in big data and streaming context. >" > >Please cast your vote: > >[ ] +1, bring Griffin into Incubator >[ ] +0, I don't care either way, >[ ] -1, do not bring Griffin into Incubator, because... > >This vote will be open at least for 72 hours and only votes from the >Incubator PMC are binding. > >The VOTE will end 12/5 9am PST to pass through weekend. > > >Here is the link to the proposal: > >https://wiki.apache.org/incubator/GriffinProposal > >I have copied the proposal below for easy access > > >Thanks, > >- Henry > > > >Griffin Proposal > >Abstract > >Griffin is a Data Quality Service platform built on Apache Hadoop and >Apache Spark. It provides a framework process for defining data >quality model, executing data quality measurement, automating data >profiling and validation, as well as a unified data quality >visualization across multiple data systems. It tries to address the >data quality challenges in big data and streaming context. > >Proposal > >Griffin is a open source Data Quality solution for distributed data >systems at any scale in both streaming or batch data context. When >people use open source products (e.g. Apache Hadoop, Apache Spark, >Apache Kafka, Apache Storm), they always need a data quality service >to build his/her confidence on data quality processed by those >platforms. Griffin creates a unified process to define and construct >data quality measurement pipeline across multiple data systems to >provide: > >Automatic quality validation of the data >Data profiling and anomaly detection >Data quality lineage from upstream to downstream data systems. >Data quality health monitoring visualization >Shared infrastructure resource management > >Overview of Griffin > >Griffin has been deployed in production at eBay serving major data >systems, it takes a platform approach to provide generic features to >solve common data quality validation pain points. Firstly, user can >register the data asset which user wants to do data quality check. The >data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop >system or near real-time streaming data from Apache Kafka, Apache >Storm and other real time data platforms. Secondly, user can create >data quality model to define the data quality rule and metadata. >Thirdly, the model or rule will be executed automatically (by the >model engine) to get the sample data quality validation results in a >few seconds for streaming data. Finally, user can analyze the data >quality results through built-in visualization tool to take actions. > >Griffin includes: > >Data Quality Model Engine > >Griffin is model driven solution, user can choose various data quality >dimension to execute his/her data quality validation based on selected >target data-set or source data-set ( as the golden reference data). It >has a corresponding library supporting it in back-end for the >following measurement: > >Accuracy - Does data reflect the real-world objects or a verifiable source >Completeness - Is all necessary data present >Validity - Are all data values within the data domains specified by the >business >Timeliness - Is the data available at the time needed >Anomaly detection - Pre-built algorithm functions for the >identification of items, events or observations which do not conform >to an expected pattern or other items in a dataset >Data Profiling - Apply statistical analysis and assessment of data >values within a dataset for consistency, uniqueness and logic. > >Data Collection Layer > >We support two kinds of data sources, batch data and real time data. > >For batch mode, we can collect data source from Apache Hadoop based >platform by various data connectors. > >For real time mode, we can connect with messaging system like Kafka to >near real time analysis. > >Data Process and Storage Layer > >For batch analysis, our data quality model will compute data quality >metrics in our spark cluster based on data source in Apache Hadoop. > >For near real time analysis, we consume data from messaging system, >then our data quality model will compute our real time data quality >metrics in our spark cluster. for data storage, we use time series >database in our back end to fulfill front end request. > >Griffin Service > >We have RESTful web services to accomplish all the functionalities of >Griffin, such as register data asset, create data quality model, >publish metrics, re
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) On Thu, Dec 1, 2016 at 8:47 AM, Andrew Purtell wrote: > +1 (binding) > > > On Dec 1, 2016, at 8:35 AM, Felix Cheung wrote: > > > > +1 > > > > On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra > > wrote: > > > >> Hi All, > >> > >> As the champion for Griffin, I would like to start VOTE to bring the > >> project as Apache incubator podling. > >> > >> Here is the direct quote from the abstract: > >> > >> " > >> Griffin is a Data Quality Service platform built on Apache Hadoop and > >> Apache Spark. It provides a framework process for defining data > >> quality model, executing data quality measurement, automating data > >> profiling and validation, as well as a unified data quality > >> visualization across multiple data systems. It tries to address the > >> data quality challenges in big data and streaming context. > >> " > >> > >> Please cast your vote: > >> > >> [ ] +1, bring Griffin into Incubator > >> [ ] +0, I don't care either way, > >> [ ] -1, do not bring Griffin into Incubator, because... > >> > >> This vote will be open at least for 72 hours and only votes from the > >> Incubator PMC are binding. > >> > >> The VOTE will end 12/5 9am PST to pass through weekend. > >> > >> > >> Here is the link to the proposal: > >> > >> https://wiki.apache.org/incubator/GriffinProposal > >> > >> I have copied the proposal below for easy access > >> > >> > >> Thanks, > >> > >> - Henry > >> > >> > >> > >> Griffin Proposal > >> > >> Abstract > >> > >> Griffin is a Data Quality Service platform built on Apache Hadoop and > >> Apache Spark. It provides a framework process for defining data > >> quality model, executing data quality measurement, automating data > >> profiling and validation, as well as a unified data quality > >> visualization across multiple data systems. It tries to address the > >> data quality challenges in big data and streaming context. > >> > >> Proposal > >> > >> Griffin is a open source Data Quality solution for distributed data > >> systems at any scale in both streaming or batch data context. When > >> people use open source products (e.g. Apache Hadoop, Apache Spark, > >> Apache Kafka, Apache Storm), they always need a data quality service > >> to build his/her confidence on data quality processed by those > >> platforms. Griffin creates a unified process to define and construct > >> data quality measurement pipeline across multiple data systems to > >> provide: > >> > >> Automatic quality validation of the data > >> Data profiling and anomaly detection > >> Data quality lineage from upstream to downstream data systems. > >> Data quality health monitoring visualization > >> Shared infrastructure resource management > >> > >> Overview of Griffin > >> > >> Griffin has been deployed in production at eBay serving major data > >> systems, it takes a platform approach to provide generic features to > >> solve common data quality validation pain points. Firstly, user can > >> register the data asset which user wants to do data quality check. The > >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > >> system or near real-time streaming data from Apache Kafka, Apache > >> Storm and other real time data platforms. Secondly, user can create > >> data quality model to define the data quality rule and metadata. > >> Thirdly, the model or rule will be executed automatically (by the > >> model engine) to get the sample data quality validation results in a > >> few seconds for streaming data. Finally, user can analyze the data > >> quality results through built-in visualization tool to take actions. > >> > >> Griffin includes: > >> > >> Data Quality Model Engine > >> > >> Griffin is model driven solution, user can choose various data quality > >> dimension to execute his/her data quality validation based on selected > >> target data-set or source data-set ( as the golden reference data). It > >> has a corresponding library supporting it in back-end for the > >> following measurement: > >> > >> Accuracy - Does data reflect the real-world objects or a verifiable > source > >> Completeness - Is all necessary data present > >> Validity - Are all data values within the data domains specified by the > >> business > >> Timeliness - Is the data available at the time needed > >> Anomaly detection - Pre-built algorithm functions for the > >> identification of items, events or observations which do not conform > >> to an expected pattern or other items in a dataset > >> Data Profiling - Apply statistical analysis and assessment of data > >> values within a dataset for consistency, uniqueness and logic. > >> > >> Data Collection Layer > >> > >> We support two kinds of data sources, batch data and real time data. > >> > >> For batch mode, we can collect data source from Apache Hadoop based > >> platform by various data connectors. > >> > >> For real time mode, we can connect with messaging system like Kafka to > >> near real time analysis. > >> > >> Data Process and Storage Layer > >>
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) > On Dec 1, 2016, at 8:35 AM, Felix Cheung wrote: > > +1 > > On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra > wrote: > >> Hi All, >> >> As the champion for Griffin, I would like to start VOTE to bring the >> project as Apache incubator podling. >> >> Here is the direct quote from the abstract: >> >> " >> Griffin is a Data Quality Service platform built on Apache Hadoop and >> Apache Spark. It provides a framework process for defining data >> quality model, executing data quality measurement, automating data >> profiling and validation, as well as a unified data quality >> visualization across multiple data systems. It tries to address the >> data quality challenges in big data and streaming context. >> " >> >> Please cast your vote: >> >> [ ] +1, bring Griffin into Incubator >> [ ] +0, I don't care either way, >> [ ] -1, do not bring Griffin into Incubator, because... >> >> This vote will be open at least for 72 hours and only votes from the >> Incubator PMC are binding. >> >> The VOTE will end 12/5 9am PST to pass through weekend. >> >> >> Here is the link to the proposal: >> >> https://wiki.apache.org/incubator/GriffinProposal >> >> I have copied the proposal below for easy access >> >> >> Thanks, >> >> - Henry >> >> >> >> Griffin Proposal >> >> Abstract >> >> Griffin is a Data Quality Service platform built on Apache Hadoop and >> Apache Spark. It provides a framework process for defining data >> quality model, executing data quality measurement, automating data >> profiling and validation, as well as a unified data quality >> visualization across multiple data systems. It tries to address the >> data quality challenges in big data and streaming context. >> >> Proposal >> >> Griffin is a open source Data Quality solution for distributed data >> systems at any scale in both streaming or batch data context. When >> people use open source products (e.g. Apache Hadoop, Apache Spark, >> Apache Kafka, Apache Storm), they always need a data quality service >> to build his/her confidence on data quality processed by those >> platforms. Griffin creates a unified process to define and construct >> data quality measurement pipeline across multiple data systems to >> provide: >> >> Automatic quality validation of the data >> Data profiling and anomaly detection >> Data quality lineage from upstream to downstream data systems. >> Data quality health monitoring visualization >> Shared infrastructure resource management >> >> Overview of Griffin >> >> Griffin has been deployed in production at eBay serving major data >> systems, it takes a platform approach to provide generic features to >> solve common data quality validation pain points. Firstly, user can >> register the data asset which user wants to do data quality check. The >> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop >> system or near real-time streaming data from Apache Kafka, Apache >> Storm and other real time data platforms. Secondly, user can create >> data quality model to define the data quality rule and metadata. >> Thirdly, the model or rule will be executed automatically (by the >> model engine) to get the sample data quality validation results in a >> few seconds for streaming data. Finally, user can analyze the data >> quality results through built-in visualization tool to take actions. >> >> Griffin includes: >> >> Data Quality Model Engine >> >> Griffin is model driven solution, user can choose various data quality >> dimension to execute his/her data quality validation based on selected >> target data-set or source data-set ( as the golden reference data). It >> has a corresponding library supporting it in back-end for the >> following measurement: >> >> Accuracy - Does data reflect the real-world objects or a verifiable source >> Completeness - Is all necessary data present >> Validity - Are all data values within the data domains specified by the >> business >> Timeliness - Is the data available at the time needed >> Anomaly detection - Pre-built algorithm functions for the >> identification of items, events or observations which do not conform >> to an expected pattern or other items in a dataset >> Data Profiling - Apply statistical analysis and assessment of data >> values within a dataset for consistency, uniqueness and logic. >> >> Data Collection Layer >> >> We support two kinds of data sources, batch data and real time data. >> >> For batch mode, we can collect data source from Apache Hadoop based >> platform by various data connectors. >> >> For real time mode, we can connect with messaging system like Kafka to >> near real time analysis. >> >> Data Process and Storage Layer >> >> For batch analysis, our data quality model will compute data quality >> metrics in our spark cluster based on data source in Apache Hadoop. >> >> For near real time analysis, we consume data from messaging system, >> then our data quality model will compute our real time data qual
Re: [VOTE] Bring Griffin to Apache Incubator
+1 On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra wrote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back end to fulfill front end request. > > Griffin Service > > We have RESTful web services to accomplish all the functionalities of > Griffin,
Re: [VOTE] Bring Griffin to Apache Incubator
Of course my Vote +1 (binding) On Wed, Nov 30, 2016 at 10:40 PM Henry Saputra wrote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back end to fulfill front end request. > > Griffin Service > > We have RESTful web services to accomplish all the f
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (binding) On 1 December 2016 at 06:40, Henry Saputra wrote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back end to fulfill front end request. > > Griffin Service > > We have RESTful web services to accomplish all the functionalities of > Gri
Re: [VOTE] Bring Griffin to Apache Incubator
+1. Data quality is critical to the success of data driven business. This is a good initiative to draw people attention on ensuring data quality. From: Hao Chen Sent: December 1, 2016 6:01:15 AM To: general@incubator.apache.org Subject: Re: [VOTE] Bring Griffin to Apache Incubator +1 (non-binding), Griffin looks like a very valuable project to fill the challenging gap of data quality solution in big data domain. - Hao On Thu, Dec 1, 2016 at 2:40 PM, Henry Saputra wrote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data
Re: [VOTE] Bring Griffin to Apache Incubator
+1 (non-binding), Griffin looks like a very valuable project to fill the challenging gap of data quality solution in big data domain. - Hao On Thu, Dec 1, 2016 at 2:40 PM, Henry Saputra wrote: > Hi All, > > As the champion for Griffin, I would like to start VOTE to bring the > project as Apache incubator podling. > > Here is the direct quote from the abstract: > > " > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > " > > Please cast your vote: > > [ ] +1, bring Griffin into Incubator > [ ] +0, I don't care either way, > [ ] -1, do not bring Griffin into Incubator, because... > > This vote will be open at least for 72 hours and only votes from the > Incubator PMC are binding. > > The VOTE will end 12/5 9am PST to pass through weekend. > > > Here is the link to the proposal: > > https://wiki.apache.org/incubator/GriffinProposal > > I have copied the proposal below for easy access > > > Thanks, > > - Henry > > > > Griffin Proposal > > Abstract > > Griffin is a Data Quality Service platform built on Apache Hadoop and > Apache Spark. It provides a framework process for defining data > quality model, executing data quality measurement, automating data > profiling and validation, as well as a unified data quality > visualization across multiple data systems. It tries to address the > data quality challenges in big data and streaming context. > > Proposal > > Griffin is a open source Data Quality solution for distributed data > systems at any scale in both streaming or batch data context. When > people use open source products (e.g. Apache Hadoop, Apache Spark, > Apache Kafka, Apache Storm), they always need a data quality service > to build his/her confidence on data quality processed by those > platforms. Griffin creates a unified process to define and construct > data quality measurement pipeline across multiple data systems to > provide: > > Automatic quality validation of the data > Data profiling and anomaly detection > Data quality lineage from upstream to downstream data systems. > Data quality health monitoring visualization > Shared infrastructure resource management > > Overview of Griffin > > Griffin has been deployed in production at eBay serving major data > systems, it takes a platform approach to provide generic features to > solve common data quality validation pain points. Firstly, user can > register the data asset which user wants to do data quality check. The > data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop > system or near real-time streaming data from Apache Kafka, Apache > Storm and other real time data platforms. Secondly, user can create > data quality model to define the data quality rule and metadata. > Thirdly, the model or rule will be executed automatically (by the > model engine) to get the sample data quality validation results in a > few seconds for streaming data. Finally, user can analyze the data > quality results through built-in visualization tool to take actions. > > Griffin includes: > > Data Quality Model Engine > > Griffin is model driven solution, user can choose various data quality > dimension to execute his/her data quality validation based on selected > target data-set or source data-set ( as the golden reference data). It > has a corresponding library supporting it in back-end for the > following measurement: > > Accuracy - Does data reflect the real-world objects or a verifiable source > Completeness - Is all necessary data present > Validity - Are all data values within the data domains specified by the > business > Timeliness - Is the data available at the time needed > Anomaly detection - Pre-built algorithm functions for the > identification of items, events or observations which do not conform > to an expected pattern or other items in a dataset > Data Profiling - Apply statistical analysis and assessment of data > values within a dataset for consistency, uniqueness and logic. > > Data Collection Layer > > We support two kinds of data sources, batch data and real time data. > > For batch mode, we can collect data source from Apache Hadoop based > platform by various data connectors. > > For real time mode, we can connect with messaging system like Kafka to > near real time analysis. > > Data Process and Storage Layer > > For batch analysis, our data quality model will compute data quality > metrics in our spark cluster based on data source in Apache Hadoop. > > For near real time analysis, we consume data from messaging system, > then our data quality model will compute our real time data quality > metrics in our spark cluster. for data storage, we use time series > database in our back en