Repository: incubator-griffin-site Updated Branches: refs/heads/asf-site 6eaa9c6f0 -> 5558bdcb3
Updated asf-site site from master (0777296868773f3456019df24829827a90b46fde) Project: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/commit/5558bdcb Tree: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/tree/5558bdcb Diff: http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/diff/5558bdcb Branch: refs/heads/asf-site Commit: 5558bdcb38389a82d8adc73f8f7d0a55a82e3e48 Parents: 6eaa9c6 Author: William Guo <gu...@apache.org> Authored: Tue Sep 18 15:56:32 2018 +0800 Committer: William Guo <gu...@apache.org> Committed: Tue Sep 18 15:56:32 2018 +0800 ---------------------------------------------------------------------- docs/profiling.html | 162 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 161 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-griffin-site/blob/5558bdcb/docs/profiling.html ---------------------------------------------------------------------- diff --git a/docs/profiling.html b/docs/profiling.html index 273452f..1826c21 100644 --- a/docs/profiling.html +++ b/docs/profiling.html @@ -130,7 +130,167 @@ under the License. </div> <div class="col-xs-6 col-sm-9 page-main-content" style="margin-left: -15px" id="loadcontent"> <h1 class="page-header" style="margin-top: 0px">Profiling Use Case</h1> - + <h2 id="user-story">User Story</h2> +<p>Say we have one data set(demo_src), partitioned by hour, we want to know what is the data like for each hour.</p> + +<p>For simplicity, suppose both two data set have the same schema as this:</p> +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id bigint +age int +desc string +dt string +hour string +</code></pre></div></div> +<p>both dt and hour are partitions,</p> + +<p>as every day we have one daily partition dt(like 20180912),</p> + +<p>for every day we have 24 hourly partitions(like 00, 01, 02, â¦, 23).</p> + +<h2 id="environment-preparation">Environment Preparation</h2> +<p>You need to prepare the environment for Apache Griffin measure module, including the following software:</p> +<ul> + <li>JDK (1.8+)</li> + <li>Hadoop (2.6.0+)</li> + <li>Spark (2.2.1+)</li> + <li>Hive (2.2.0)</li> +</ul> + +<h2 id="build-griffin-measure-module">Build Griffin Measure Module</h2> +<ol> + <li>Download Griffin source package <a href="https://www.apache.org/dist/incubator/griffin/0.3.0-incubating">here</a>.</li> + <li>Unzip the source package. + <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>unzip griffin-0.3.0-incubating-source-release.zip +cd griffin-0.3.0-incubating-source-release +</code></pre></div> </div> + </li> + <li>Build Griffin jars. + <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mvn clean install +</code></pre></div> </div> + + <p>Move the built griffin measure jar to your work path.</p> + + <div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mv measure/target/measure-0.3.0-incubating.jar <work path>/griffin-measure.jar +</code></pre></div> </div> + </li> +</ol> + +<h2 id="data-preparation">Data Preparation</h2> + +<p>For our quick start, We will generate a hive table demo_src.</p> +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--create hive tables here. hql script +--Note: replace hdfs location with your own path +CREATE EXTERNAL TABLE `demo_src`( + `id` bigint, + `age` int, + `desc` string) +PARTITIONED BY ( + `dt` string, + `hour` string) +ROW FORMAT DELIMITED + FIELDS TERMINATED BY '|' +LOCATION + 'hdfs:///griffin/data/batch/demo_src'; +</code></pre></div></div> +<p>The data could be generated this:</p> +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1|18|student +2|23|engineer +3|42|cook +... +</code></pre></div></div> +<p>You can download <a href="/data/batch">demo data</a> and execute <code class="highlighter-rouge">./gen_demo_data.sh</code> to get the data source file. +Then we will load data into hive table for every hour.</p> +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>LOAD DATA LOCAL INPATH 'demo_src' INTO TABLE demo_src PARTITION (dt='20180912',hour='09'); +</code></pre></div></div> +<p>Or you can just execute <code class="highlighter-rouge">./gen-hive-data.sh</code> in the downloaded directory above, to generate and load data into the tables hourly.</p> + +<h2 id="define-data-quality-measure">Define data quality measure</h2> + +<h4 id="griffin-env-configuration">Griffin env configuration</h4> +<p>The environment config file: env.json</p> +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ + "spark": { + "log.level": "WARN" + }, + "sinks": [ + { + "type": "console" + }, + { + "type": "hdfs", + "config": { + "path": "hdfs:///griffin/persist" + } + }, + { + "type": "elasticsearch", + "config": { + "method": "post", + "api": "http://es:9200/griffin/accuracy" + } + } + ] +} +</code></pre></div></div> + +<h4 id="define-griffin-data-quality">Define griffin data quality</h4> +<p>The DQ config file: dq.json</p> + +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>{ + "name": "batch_prof", + "process.type": "batch", + "data.sources": [ + { + "name": "src", + "baseline": true, + "connectors": [ + { + "type": "hive", + "version": "1.2", + "config": { + "database": "default", + "table.name": "demo_tgt" + } + } + ] + } + ], + "evaluate.rule": { + "rules": [ + { + "dsl.type": "griffin-dsl", + "dq.type": "profiling", + "out.dataframe.name": "prof", + "rule": "src.id.count() AS id_count, src.age.max() AS age_max, src.desc.length().max() AS desc_length_max", + "out": [ + { + "type": "metric", + "name": "prof" + } + ] + } + ] + }, + "sinks": ["CONSOLE", "HDFS"] +} +</code></pre></div></div> + +<h2 id="measure-data-quality">Measure data quality</h2> +<p>Submit the measure job to Spark, with config file paths as parameters.</p> + +<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>spark-submit --class org.apache.griffin.measure.Application --master yarn --deploy-mode client --queue default \ +--driver-memory 1g --executor-memory 1g --num-executors 2 \ +<path>/griffin-measure.jar \ +<path>/env.json <path>/dq.json +</code></pre></div></div> + +<h2 id="report-data-quality-metrics">Report data quality metrics</h2> +<p>Then you can get the calculation log in console, after the job finishes, you can get the result metrics printed. The metrics will also be saved in hdfs: <code class="highlighter-rouge">hdfs:///griffin/persist/<job name>/<timestamp>/_METRICS</code>.</p> + +<h2 id="refine-data-quality-report">Refine Data Quality report</h2> +<p>Depends on your business, you might need to refine your data quality measure further till your are satisfied.</p> + +<h2 id="more-details">More Details</h2> +<p>For more details about griffin measures, you can visit our documents in <a href="https://github.com/apache/incubator-griffin/tree/master/griffin-doc">github</a>.</p> </div><!--end of loadcontent--> </div>