Hi Randy,

Thanks for your email. To my understanding, your requirement is to get
daily missing data or anomaly detection, is that right?

*1. For missing data, griffin could create an accuracy measure, which
measures the match percentage between the source-of-truth data and the
target data, and persist the missing data.* The configuration would be like
this:

{
  "name": "accuracy",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "source",
      "baseline": true,
      "connectors": [
        {
          "type": "S3",
          "version": "",
          "config": {
            "file.name": "s3a://bucket_name/path/to/normalized_data"
          }
        }
      ]
    }, {
      "name": "target",
      "connectors": [
        {
          "type": "S3",
          "version": "",
          "config": {
            "file.name": "s3a://bucket_name/path/to/raw_data"
          },
          "pre.proc": [
            {
              "dsl.type": "df-ops",
              "rule": "decode"
            }
          ]
        }
      ]
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "rule": "source.id = target.id AND source.timestamp =
target.timestamp AND source.currency = target.currency"
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"]
}

*2. For anomaly detection, I'm not sure the bound of value in your case is
static or not. For static bounds, a profiling measure could handle this,
while for dynamic bounds, the feature is not supported in current version.*
For the profiling measure, the configuration would be like this:

{
  "name": "profiling",
  "process.type": "batch",
  "data.sources": [
    {
      "name": "source",
      "baseline": true,
      "connectors": [
        {
          "type": "S3",
          "version": "",
          "config": {
            "file.name": "s3a://bucket_name/path/to/normalized_data"
          }
        }
      ]
    }
  ],
  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "profiling",
        "rule": "count(value) as total_count from source"
      },
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "profiling",
        "rule": "count(value) as anomaly_count from source where value < 0
or value > 100000"
      }
    ]
  },
  "sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"]
}

*3. I notice that your data source is on S3, which is not supported at
current. However, it will be not difficult to handle this, since Spark
could read data from S3.*

Hope this can help you, and I need more details about your requirement if
you would like to leverage griffin.
Sincerely looking forward your feedback.


Thanks,
Lionel, Liu

On Tue, Aug 7, 2018 at 6:57 PM William Guo <[email protected]> wrote:

> + to users mailing list.
>
>
> ---------- Forwarded message ----------
> From: Randy Myers <[email protected]>
> Date: Tue, Aug 7, 2018 at 5:46 AM
> Subject: Re: Hello from griffin team
> To: William Guo <[email protected]>
>
>
> Hi William,
>
> My company tickPredict wants to collect market data from cryptocurrency
> exchanges (Coinbase, Kraken, Binance, and others).  This data available
> from the exchanges via a public feed.  I most cases, the data is via
> websocket, but in a few, we have to use REST.
>
> We will TimeStamp this data when we receive the message to our AWS cloud
> instance.  We will be saving both raw data (send directly to an S3 bucket)
> and "normalized data".  This normalize data will aggregate the various
> feeds into one table structure.  We will make this normalized data
> available to Python/Pandas and R.
>
> One of our clients mentioned Griffin as a possible solution for anomaly
> detection and reporting.  The client would like to have a daily report of
> missing data or anomalies, which might be values that are out of bounds of
> reality.
>
> Is Griffin something we could use for these purposes?
>
> Thanks!
>
> Randy Myers
> Linkedin.com/in/RandyMy
> 312.933.6476 m
>
>
> On Fri, Aug 3, 2018 at 5:15 PM, William Guo <[email protected]> wrote:
>
>> Sure,
>>
>> Looking forward to it.
>>
>>
>> Thanks,
>> William
>>
>> On Sat, Aug 4, 2018 at 12:58 AM, Randy Myers <[email protected]>
>> wrote:
>>
>>> Thanks for the note, William.  I will get you some details of our
>>> project over the weekend.
>>>
>>> Randy Myers
>>> Linkedin.com/in/RandyMy
>>> 312.933.6476 m
>>>
>>> On Fri, Aug 3, 2018 at 9:22 AM, William Guo <[email protected]> wrote:
>>>
>>>> Hello Randy,
>>>>
>>>> This is William from griffin team, Vivian told us your message.
>>>>
>>>> We are pleased to support you.
>>>>
>>>> Could you tell us your use cases for data quality so we can follow your
>>>> insights better.
>>>>
>>>>
>>>> Thanks,
>>>> William
>>>>
>>>>
>>>>
>>>
>>
>
>

Reply via email to