RE: Few Questions about Griffin

2018-04-07 Thread Lionel, Liu
Hi Vinod,

For the first question, it looks like the validity dimension, to measure the 
data item by the rules defined. The validity dimension has not been implemented 
in griffin, but you can also make it work by profiling at current. For example, 
you can define the profiling rule as “select count(*) from source where 
len(telephone) = 10 and name is not null”, that will produce the count of items 
matched such a rule, with another metric as total count, then you’ll get the 
percentage. In fact, getting the count metrics is better than getting the 
percentage directly.
For the second question, I’m not very familiar with Kerberos, but in eBay, 
we’re also using hdfs cluster with Kerberos authentication. Griffin measure 
module works as a spark application, and it supports all the spark parameters, 
so it should work in the same way like you submit other spark applications on 
your cluster. If not correct pls tell me, thanks.

Thanks
Lionel, Liu

From: Vinod Raina
Sent: 2018年4月5日 13:09
To: Lionel Liu; dev@griffin.incubator.apache.org
Cc: Karan Gupta
Subject: RE: Few Questions about Griffin

Thank you Lionel, 
I have 2 more follow queries :
1. My requirement is to check the data quality in terms of whether the data 
confirms to the data types that I expect it to be. E.g One column may have 
telephone number, so I expect it to be 10 digit number , another column is 
birthdate, so I expect it to be in a date format or there is a name column and 
I don’t want it to be null/missing. So I need to create a metric report where I 
can get to see the percentage of data that confirms to the validations that we 
have created. Can griffin do that ?
2. Also, Our HDFS is a kerberised cluster. Can griffin work on a kerberised 
cluster ?



Regards
Vinod Raina | vinod.ra...@tavant.com
Associate Technical Architect
M: +91 9711022965

From: Lionel Liu  
Sent: Tuesday, April 3, 2018 2:16 PM
To: dev@griffin.incubator.apache.org; Vinod Raina 
Cc: Karan Gupta 
Subject: Re: Few Questions about Griffin

Hi Vinod,

We're glad to receive your email, there're some other documents of Griffin 
listed below:
wiki: https://cwiki.apache.org/confluence/display/GRIFFIN/Apache+Griffin
github: https://github.com/apache/incubator-griffin/tree/master/griffin-doc
And you can follow 
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/docker/griffin-docker-guide.md
 to try griffin docker image.

For your questions, I'll list my answers: 

1. What is the usage of accuracy metric? In what situations, it will be useful?

Accuracy measures the match percentage between two data sources, we call them 
"target" and "source", "source" is the data source you trust, "target" is the 
data source you want to check. 
For example, say "source" is [1, 2, 3, 4, 5], while "target" is [1, 3, 5, 7, 
9], we'll get the accuracy #(target items matched in source) / #(all target 
items) = 3/5 = 80%. Actually, "exactly match" is a narrow concept, in accuracy, 
we say "pass the match rule", users can define their own "match rule" like 
"source.age <= target.age AND upper(source.city) = upper(target.city)" instead 
of "exactly match".
When we have a data source we trust, let it be the "source", then we can 
measure accuracy of another data source named "target", to figure out how 
correctly we can trust.

There's a standard use case: 
In our data pipeline, when we get users' data from site, we persist it as table 
T1, which we trust it as the source of truth. On the other hand, a copy of 
users' data will be pushed to some streaming or batch processes, after some 
steps, the processed data is persisted as table T2, we want to know how correct 
it is, or how much we can trust it. 
Set T1 as "source", T2 as "target", we can get the accuracy of T2, with the 
wrong items from T2 persisted.

And another specific use case: 
We have a streaming data process system, it consumes data from input and 
produces to output. In each output data item, it also contains the key of input 
item, we want to know how much data is successfully processed. 
Set output as "source", input as "target", we can get the accuracy of input, 
and the missing items from input will be persisted.
Actually, this case measures the completeness of output, but it works like 
reversed accuracy, so we can use it like this.

However, in griffin measure configuration, the concept of source and target are 
based on the code implementation, which is different from the business concept 
above. In the documents of measure configuration, we're measuring accuracy of 
"source".
We are planning to modify the code implementation to be align with the business 
concept later, by then, we'll highlight it in the release notes.


2. Can we run other metrics using command-line? (or) Is o

RE: Few Questions about Griffin

2018-04-04 Thread Vinod Raina
Thank you Lionel,
I have 2 more follow queries :

  1.  My requirement is to check the data quality in terms of whether the data 
confirms to the data types that I expect it to be. E.g One column may have 
telephone number, so I expect it to be 10 digit number , another column is 
birthdate, so I expect it to be in a date format or there is a name column and 
I don’t want it to be null/missing. So I need to create a metric report where I 
can get to see the percentage of data that confirms to the validations that we 
have created. Can griffin do that ?
  2.  Also, Our HDFS is a kerberised cluster. Can griffin work on a kerberised 
cluster ?



Regards
Vinod Raina | vinod.ra...@tavant.com<mailto:vinod.ra...@tavant.com>
Associate Technical Architect
M: +91 9711022965

From: Lionel Liu 
Sent: Tuesday, April 3, 2018 2:16 PM
To: dev@griffin.incubator.apache.org; Vinod Raina 
Cc: Karan Gupta 
Subject: Re: Few Questions about Griffin

Hi Vinod,

We're glad to receive your email, there're some other documents of Griffin 
listed below:
wiki: 
https://cwiki.apache.org/confluence/display/GRIFFIN/Apache+Griffin<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FGRIFFIN%2FApache%2BGriffin&data=01%7C01%7Cvinod.raina%40tavant.com%7C99770c25b3bf4350c15a08d5993f6711%7Cc6c1e9da5d0c4f8f9a023c67206efbd6%7C0&sdata=K1%2Be1%2F%2F3xdxV7Y9HMDwAeOS3Us6x1L2lGw6hD1WcdGg%3D&reserved=0>
github: 
https://github.com/apache/incubator-griffin/tree/master/griffin-doc<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-griffin%2Ftree%2Fmaster%2Fgriffin-doc&data=01%7C01%7Cvinod.raina%40tavant.com%7C99770c25b3bf4350c15a08d5993f6711%7Cc6c1e9da5d0c4f8f9a023c67206efbd6%7C0&sdata=XsJDny0l9frweakqLEMPMpTgtLCdJWBer59QcDaIi%2Bk%3D&reserved=0>
And you can follow 
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/docker/griffin-docker-guide.md<https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-griffin%2Fblob%2Fmaster%2Fgriffin-doc%2Fdocker%2Fgriffin-docker-guide.md&data=01%7C01%7Cvinod.raina%40tavant.com%7C99770c25b3bf4350c15a08d5993f6711%7Cc6c1e9da5d0c4f8f9a023c67206efbd6%7C0&sdata=gV%2FwnKgcBn3CaphB636zwFz5llJPMOuKmxlgQE0Oqf0%3D&reserved=0>
 to try griffin docker image.

For your questions, I'll list my answers:

1. What is the usage of accuracy metric? In what situations, it will be useful?

Accuracy measures the match percentage between two data sources, we call them 
"target" and "source", "source" is the data source you trust, "target" is the 
data source you want to check.
For example, say "source" is [1, 2, 3, 4, 5], while "target" is [1, 3, 5, 7, 
9], we'll get the accuracy #(target items matched in source) / #(all target 
items) = 3/5 = 80%. Actually, "exactly match" is a narrow concept, in accuracy, 
we say "pass the match rule", users can define their own "match rule" like 
"source.age <= target.age AND upper(source.city) = upper(target.city)" instead 
of "exactly match".
When we have a data source we trust, let it be the "source", then we can 
measure accuracy of another data source named "target", to figure out how 
correctly we can trust.

There's a standard use case:
In our data pipeline, when we get users' data from site, we persist it as table 
T1, which we trust it as the source of truth. On the other hand, a copy of 
users' data will be pushed to some streaming or batch processes, after some 
steps, the processed data is persisted as table T2, we want to know how correct 
it is, or how much we can trust it.
Set T1 as "source", T2 as "target", we can get the accuracy of T2, with the 
wrong items from T2 persisted.

And another specific use case:
We have a streaming data process system, it consumes data from input and 
produces to output. In each output data item, it also contains the key of input 
item, we want to know how much data is successfully processed.
Set output as "source", input as "target", we can get the accuracy of input, 
and the missing items from input will be persisted.
Actually, this case measures the completeness of output, but it works like 
reversed accuracy, so we can use it like this.

However, in griffin measure configuration, the concept of source and target are 
based on the code implementation, which is different from the business concept 
above. In the documents of measure configuration, we're measuring accuracy of 
"source".
We are planning to modify the code implementation to be align with the business 
concept later, by then, we'll highlight it in the release notes.


2. Can we run other metrics using command-line? (or) Is only accuracy metric 
suppo

Re: Few Questions about Griffin

2018-04-03 Thread Lionel Liu
Hi Vinod,

We're glad to receive your email, there're some other documents of Griffin
listed below:
wiki: https://cwiki.apache.org/confluence/display/GRIFFIN/Apache+Griffin
github: https://github.com/apache/incubator-griffin/tree/master/griffin-doc
And you can follow
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/docker/griffin-docker-guide.md
to try griffin docker image.

For your questions, I'll list my answers:

*1. What is the usage of accuracy metric? In what situations, it will be
useful?*

Accuracy measures the match percentage between two data sources, we call
them "target" and "source", "source" is the data source you trust, "target"
is the data source you want to check.
For example, say "source" is [1, 2, 3, 4, 5], while "target" is [1, 3, 5,
7, 9], we'll get the accuracy #(target items matched in source) / #(all
target items) = 3/5 = 80%. Actually, "exactly match" is a narrow concept,
in accuracy, we say "pass the match rule", users can define their own
"match rule" like "source.age <= target.age AND upper(source.city) =
upper(target.city)" instead of "exactly match".
When we have a data source we trust, let it be the "source", then we can
measure accuracy of another data source named "target", to figure out how
correctly we can trust.

There's a standard use case:
In our data pipeline, when we get users' data from site, we persist it as
table T1, which we trust it as the source of truth. On the other hand, a
copy of users' data will be pushed to some streaming or batch processes,
after some steps, the processed data is persisted as table T2, we want to
know how correct it is, or how much we can trust it.
Set T1 as "source", T2 as "target", we can get the accuracy of T2, with the
wrong items from T2 persisted.

And another specific use case:
We have a streaming data process system, it consumes data from input and
produces to output. In each output data item, it also contains the key of
input item, we want to know how much data is successfully processed.
Set output as "source", input as "target", we can get the accuracy of
input, and the missing items from input will be persisted.
Actually, this case measures the completeness of output, but it works like
reversed accuracy, so we can use it like this.

However, in griffin measure configuration, the concept of source and target
are based on the code implementation, which is different from the business
concept above. In the documents of measure configuration, we're measuring
accuracy of "source".
We are planning to modify the code implementation to be align with the
business concept later, by then, we'll highlight it in the release notes.


*2. Can we run other metrics using command-line? (or) Is only accuracy
metric supported at the moment?*

Yes, you can just run griffin measure module using cmd-line directly, like
this:
https://github.com/bhlx3lyx7/griffin-docker/blob/master/svc_msr_new/prep/measure/start-accu.sh
.
At current, griffin UI module doesn't support all the dimensions, but
measure module supports accuracy, profiling, timeliness and uniqueness, you
can get some description of them here:
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure/dsl-guide.md#griffin-dsl-translation-to-sql
.


*3. Project roadmap for features?*

The project roadmap is out of date, we've updated it:
https://cwiki.apache.org/confluence/display/GRIFFIN/0.+Roadmap
Some new features we're planning in the short term planning:
- streaming measure job schedule.
- more data quality dimensions support, such as completeness, consistency,
validity.
And for long term, maybe including:
- more data sources support, such as RDBs, elasticsearch.
- anomaly detection support.
- spark 2 support.


*4. Can we use create custom Rules and profile existing data?*

Yes, you can create custom rules for your data, according to the documents:
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure/measure-configuration-guide.md
and
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure/measure-batch-sample.md
.
The profiling rule supports simple spark-sql syntax directly, as
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure/dsl-guide.md#profiling
described.
If you want to use spark-sql, you can also define the rules like this:
https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure/dsl-guide.md#spark-sql
.


*5. Postgresql and mysql -- both listed in Prerequisites. We have MySQL, Is
that enough?*

In fact, you can choose either one of postgresql and mysql.
We use mysql for the measure and schedule persistance before, but due to
the license issue of release, we have to switch to postgresql these days.
If you want to use mysql, you need to modify some dependencies in service
module and the application.properties file, rebuild the service.jar as well.
We are going to place a document to help users for mysql or other db.


Hope this helps you, please feel free if