yes,

we need to review our module and support as many as spark data sources
supported.

if spark can generate dataframe/datasets from those sources, we should be
fine.

Thanks,
William

On Tue, Aug 7, 2018 at 8:10 PM, Lionel Liu <[email protected]> wrote:

> Hi Randy,
>
> Thanks for your email. To my understanding, your requirement is to get
> daily missing data or anomaly detection, is that right?
>
> *1. For missing data, griffin could create an accuracy measure, which
> measures the match percentage between the source-of-truth data and the
> target data, and persist the missing data.* The configuration would be
> like this:
>
> {
>   "name": "accuracy",
>   "process.type": "batch",
>   "data.sources": [
>     {
>       "name": "source",
>       "baseline": true,
>       "connectors": [
>         {
>           "type": "S3",
>           "version": "",
>           "config": {
>             "file.name": "s3a://bucket_name/path/to/normalized_data"
>           }
>         }
>       ]
>     }, {
>       "name": "target",
>       "connectors": [
>         {
>           "type": "S3",
>           "version": "",
>           "config": {
>             "file.name": "s3a://bucket_name/path/to/raw_data"
>           },
>           "pre.proc": [
>             {
>               "dsl.type": "df-ops",
>               "rule": "decode"
>             }
>           ]
>         }
>       ]
>     }
>   ],
>   "evaluate.rule": {
>     "rules": [
>       {
>         "dsl.type": "griffin-dsl",
>         "dq.type": "accuracy",
>         "rule": "source.id = target.id AND source.timestamp =
> target.timestamp AND source.currency = target.currency"
>       }
>     ]
>   },
>   "sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"]
> }
>
> *2. For anomaly detection, I'm not sure the bound of value in your case is
> static or not. For static bounds, a profiling measure could handle this,
> while for dynamic bounds, the feature is not supported in current version.*
> For the profiling measure, the configuration would be like this:
>
> {
>   "name": "profiling",
>   "process.type": "batch",
>   "data.sources": [
>     {
>       "name": "source",
>       "baseline": true,
>       "connectors": [
>         {
>           "type": "S3",
>           "version": "",
>           "config": {
>             "file.name": "s3a://bucket_name/path/to/normalized_data"
>           }
>         }
>       ]
>     }
>   ],
>   "evaluate.rule": {
>     "rules": [
>       {
>         "dsl.type": "griffin-dsl",
>         "dq.type": "profiling",
>         "rule": "count(value) as total_count from source"
>       },
>       {
>         "dsl.type": "griffin-dsl",
>         "dq.type": "profiling",
>         "rule": "count(value) as anomaly_count from source where value < 0
> or value > 100000"
>       }
>     ]
>   },
>   "sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"]
> }
>
> *3. I notice that your data source is on S3, which is not supported at
> current. However, it will be not difficult to handle this, since Spark
> could read data from S3.*
>
> Hope this can help you, and I need more details about your requirement if
> you would like to leverage griffin.
> Sincerely looking forward your feedback.
>
>
> Thanks,
> Lionel, Liu
>
> On Tue, Aug 7, 2018 at 6:57 PM William Guo <[email protected]> wrote:
>
>> + to users mailing list.
>>
>>
>> ---------- Forwarded message ----------
>> From: Randy Myers <[email protected]>
>> Date: Tue, Aug 7, 2018 at 5:46 AM
>> Subject: Re: Hello from griffin team
>> To: William Guo <[email protected]>
>>
>>
>> Hi William,
>>
>> My company tickPredict wants to collect market data from cryptocurrency
>> exchanges (Coinbase, Kraken, Binance, and others).  This data available
>> from the exchanges via a public feed.  I most cases, the data is via
>> websocket, but in a few, we have to use REST.
>>
>> We will TimeStamp this data when we receive the message to our AWS cloud
>> instance.  We will be saving both raw data (send directly to an S3 bucket)
>> and "normalized data".  This normalize data will aggregate the various
>> feeds into one table structure.  We will make this normalized data
>> available to Python/Pandas and R.
>>
>> One of our clients mentioned Griffin as a possible solution for anomaly
>> detection and reporting.  The client would like to have a daily report of
>> missing data or anomalies, which might be values that are out of bounds of
>> reality.
>>
>> Is Griffin something we could use for these purposes?
>>
>> Thanks!
>>
>> Randy Myers
>> Linkedin.com/in/RandyMy
>> 312.933.6476 m
>>
>>
>> On Fri, Aug 3, 2018 at 5:15 PM, William Guo <[email protected]> wrote:
>>
>>> Sure,
>>>
>>> Looking forward to it.
>>>
>>>
>>> Thanks,
>>> William
>>>
>>> On Sat, Aug 4, 2018 at 12:58 AM, Randy Myers <[email protected]>
>>> wrote:
>>>
>>>> Thanks for the note, William.  I will get you some details of our
>>>> project over the weekend.
>>>>
>>>> Randy Myers
>>>> Linkedin.com/in/RandyMy
>>>> 312.933.6476 m
>>>>
>>>> On Fri, Aug 3, 2018 at 9:22 AM, William Guo <[email protected]> wrote:
>>>>
>>>>> Hello Randy,
>>>>>
>>>>> This is William from griffin team, Vivian told us your message.
>>>>>
>>>>> We are pleased to support you.
>>>>>
>>>>> Could you tell us your use cases for data quality so we can follow
>>>>> your insights better.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> William
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>

Reply via email to