yes, we need to review our module and support as many as spark data sources supported.
if spark can generate dataframe/datasets from those sources, we should be fine. Thanks, William On Tue, Aug 7, 2018 at 8:10 PM, Lionel Liu <[email protected]> wrote: > Hi Randy, > > Thanks for your email. To my understanding, your requirement is to get > daily missing data or anomaly detection, is that right? > > *1. For missing data, griffin could create an accuracy measure, which > measures the match percentage between the source-of-truth data and the > target data, and persist the missing data.* The configuration would be > like this: > > { > "name": "accuracy", > "process.type": "batch", > "data.sources": [ > { > "name": "source", > "baseline": true, > "connectors": [ > { > "type": "S3", > "version": "", > "config": { > "file.name": "s3a://bucket_name/path/to/normalized_data" > } > } > ] > }, { > "name": "target", > "connectors": [ > { > "type": "S3", > "version": "", > "config": { > "file.name": "s3a://bucket_name/path/to/raw_data" > }, > "pre.proc": [ > { > "dsl.type": "df-ops", > "rule": "decode" > } > ] > } > ] > } > ], > "evaluate.rule": { > "rules": [ > { > "dsl.type": "griffin-dsl", > "dq.type": "accuracy", > "rule": "source.id = target.id AND source.timestamp = > target.timestamp AND source.currency = target.currency" > } > ] > }, > "sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"] > } > > *2. For anomaly detection, I'm not sure the bound of value in your case is > static or not. For static bounds, a profiling measure could handle this, > while for dynamic bounds, the feature is not supported in current version.* > For the profiling measure, the configuration would be like this: > > { > "name": "profiling", > "process.type": "batch", > "data.sources": [ > { > "name": "source", > "baseline": true, > "connectors": [ > { > "type": "S3", > "version": "", > "config": { > "file.name": "s3a://bucket_name/path/to/normalized_data" > } > } > ] > } > ], > "evaluate.rule": { > "rules": [ > { > "dsl.type": "griffin-dsl", > "dq.type": "profiling", > "rule": "count(value) as total_count from source" > }, > { > "dsl.type": "griffin-dsl", > "dq.type": "profiling", > "rule": "count(value) as anomaly_count from source where value < 0 > or value > 100000" > } > ] > }, > "sinks": ["CONSOLE", "HDFS", "ELASTICSEARCH"] > } > > *3. I notice that your data source is on S3, which is not supported at > current. However, it will be not difficult to handle this, since Spark > could read data from S3.* > > Hope this can help you, and I need more details about your requirement if > you would like to leverage griffin. > Sincerely looking forward your feedback. > > > Thanks, > Lionel, Liu > > On Tue, Aug 7, 2018 at 6:57 PM William Guo <[email protected]> wrote: > >> + to users mailing list. >> >> >> ---------- Forwarded message ---------- >> From: Randy Myers <[email protected]> >> Date: Tue, Aug 7, 2018 at 5:46 AM >> Subject: Re: Hello from griffin team >> To: William Guo <[email protected]> >> >> >> Hi William, >> >> My company tickPredict wants to collect market data from cryptocurrency >> exchanges (Coinbase, Kraken, Binance, and others). This data available >> from the exchanges via a public feed. I most cases, the data is via >> websocket, but in a few, we have to use REST. >> >> We will TimeStamp this data when we receive the message to our AWS cloud >> instance. We will be saving both raw data (send directly to an S3 bucket) >> and "normalized data". This normalize data will aggregate the various >> feeds into one table structure. We will make this normalized data >> available to Python/Pandas and R. >> >> One of our clients mentioned Griffin as a possible solution for anomaly >> detection and reporting. The client would like to have a daily report of >> missing data or anomalies, which might be values that are out of bounds of >> reality. >> >> Is Griffin something we could use for these purposes? >> >> Thanks! >> >> Randy Myers >> Linkedin.com/in/RandyMy >> 312.933.6476 m >> >> >> On Fri, Aug 3, 2018 at 5:15 PM, William Guo <[email protected]> wrote: >> >>> Sure, >>> >>> Looking forward to it. >>> >>> >>> Thanks, >>> William >>> >>> On Sat, Aug 4, 2018 at 12:58 AM, Randy Myers <[email protected]> >>> wrote: >>> >>>> Thanks for the note, William. I will get you some details of our >>>> project over the weekend. >>>> >>>> Randy Myers >>>> Linkedin.com/in/RandyMy >>>> 312.933.6476 m >>>> >>>> On Fri, Aug 3, 2018 at 9:22 AM, William Guo <[email protected]> wrote: >>>> >>>>> Hello Randy, >>>>> >>>>> This is William from griffin team, Vivian told us your message. >>>>> >>>>> We are pleased to support you. >>>>> >>>>> Could you tell us your use cases for data quality so we can follow >>>>> your insights better. >>>>> >>>>> >>>>> Thanks, >>>>> William >>>>> >>>>> >>>>> >>>> >>> >> >>
