Hi Randy,
Thanks for your email. To my understanding, your requirement is to get
daily missing data or anomaly detection, is that right?
*1. For missing data, griffin could create an accuracy measure, which
measures the match percentage between the source-of-truth data and the
target data, and persist the missing data.* The configuration would be like
this:
{
"name": "accuracy",
"process.type": "batch",
"data.sources": [
{
"name": "source",
"baseline": true,
"connectors": [
{
"type": "S3",
"version": "",
"config": {
"file.name": "s3a://bucket_name/path/to/normalized_data"
}
}
]
}, {
"name": "target",
"connectors": [
{
"type": "S3",
"version": "",
"config": {
"file.name": "s3a://bucket_name/path/to/raw_data"
},
"pre.proc": [
{
"dsl.type": "df-ops",
"rule": "decode"
}
]
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "accuracy",
"rule": "source.id = target.id AND source.timestamp =
target.timestamp AND source.currency = target.currency"
}
]
},
"sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"]
}
*2. For anomaly detection, I'm not sure the bound of value in your case is
static or not. For static bounds, a profiling measure could handle this,
while for dynamic bounds, the feature is not supported in current version.*
For the profiling measure, the configuration would be like this:
{
"name": "profiling",
"process.type": "batch",
"data.sources": [
{
"name": "source",
"baseline": true,
"connectors": [
{
"type": "S3",
"version": "",
"config": {
"file.name": "s3a://bucket_name/path/to/normalized_data"
}
}
]
}
],
"evaluate.rule": {
"rules": [
{
"dsl.type": "griffin-dsl",
"dq.type": "profiling",
"rule": "count(value) as total_count from source"
},
{
"dsl.type": "griffin-dsl",
"dq.type": "profiling",
"rule": "count(value) as anomaly_count from source where value < 0
or value > 100000"
}
]
},
"sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"]
}
*3. I notice that your data source is on S3, which is not supported at
current. However, it will be not difficult to handle this, since Spark
could read data from S3.*
Hope this can help you, and I need more details about your requirement if
you would like to leverage griffin.
Sincerely looking forward your feedback.
Thanks,
Lionel, Liu
On Tue, Aug 7, 2018 at 6:57 PM William Guo <[email protected]> wrote:
> + to users mailing list.
>
>
> ---------- Forwarded message ----------
> From: Randy Myers <[email protected]>
> Date: Tue, Aug 7, 2018 at 5:46 AM
> Subject: Re: Hello from griffin team
> To: William Guo <[email protected]>
>
>
> Hi William,
>
> My company tickPredict wants to collect market data from cryptocurrency
> exchanges (Coinbase, Kraken, Binance, and others). This data available
> from the exchanges via a public feed. I most cases, the data is via
> websocket, but in a few, we have to use REST.
>
> We will TimeStamp this data when we receive the message to our AWS cloud
> instance. We will be saving both raw data (send directly to an S3 bucket)
> and "normalized data". This normalize data will aggregate the various
> feeds into one table structure. We will make this normalized data
> available to Python/Pandas and R.
>
> One of our clients mentioned Griffin as a possible solution for anomaly
> detection and reporting. The client would like to have a daily report of
> missing data or anomalies, which might be values that are out of bounds of
> reality.
>
> Is Griffin something we could use for these purposes?
>
> Thanks!
>
> Randy Myers
> Linkedin.com/in/RandyMy
> 312.933.6476 m
>
>
> On Fri, Aug 3, 2018 at 5:15 PM, William Guo <[email protected]> wrote:
>
>> Sure,
>>
>> Looking forward to it.
>>
>>
>> Thanks,
>> William
>>
>> On Sat, Aug 4, 2018 at 12:58 AM, Randy Myers <[email protected]>
>> wrote:
>>
>>> Thanks for the note, William. I will get you some details of our
>>> project over the weekend.
>>>
>>> Randy Myers
>>> Linkedin.com/in/RandyMy
>>> 312.933.6476 m
>>>
>>> On Fri, Aug 3, 2018 at 9:22 AM, William Guo <[email protected]> wrote:
>>>
>>>> Hello Randy,
>>>>
>>>> This is William from griffin team, Vivian told us your message.
>>>>
>>>> We are pleased to support you.
>>>>
>>>> Could you tell us your use cases for data quality so we can follow your
>>>> insights better.
>>>>
>>>>
>>>> Thanks,
>>>> William
>>>>
>>>>
>>>>
>>>
>>
>
>