[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-31 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/incubator-hivemall/pull/158


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214244436
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,457 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning
+
+
+
+## What is Hivemall?
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+You can create this table as follows:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Feature index and feature value are separated by comma. When comma is 
omitted, the value is considered to be `1.0`. So, a categorical feature 
`gender#male` a [one-hot 
representation](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science)
 of `index := gender#male` and `value := 1.0`. Note that `#` is not a special 
character for categorical feature.
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+See also more detailed [document for input 
format](../getting_started/input-format.html).
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](../getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](../getting_started/input-format.html#categorical-features)
 and 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214240248
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,457 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning
+
+
+
+## What is Hivemall?
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+You can create this table as follows:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Feature index and feature value are separated by comma. When comma is 
omitted, the value is considered to be `1.0`. So, a categorical feature 
`gender#male` a [one-hot 
representation](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science)
 of `index := gender#male` and `value := 1.0`. Note that `#` is not a special 
character for categorical feature.
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+See also more detailed [document for input 
format](../getting_started/input-format.html).
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](../getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](../getting_started/input-format.html#categorical-features)
 and 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214237067
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,457 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
--- End diff --

removed 2f6e3fa


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214236762
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,457 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
--- End diff --

Remove obvious `with Apache Hivemall`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214233539
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
--- End diff --

Added 0f593c4


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214233514
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
--- End diff --

Added 0f593c4


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214233419
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+See also more detailed [document for input 
format](../getting_started/input-format.html)).
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](../getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](../getting_started/input-format.html#categorical-features)
 and 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214231425
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+See also more detailed [document for input 
format](../getting_started/input-format.html)).
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](../getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](../getting_started/input-format.html#categorical-features)
 and 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214222772
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
--- End diff --

General introduction to Apache Hive and HiveQL is not required for 
Hivemall's document. The base document is for introducing Hivemall to TD's 
customers who might not aware differences of Hive and Presto.

You can start with `Apache Hivemall is a ... lines of query as follows:`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214226384
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+See also more detailed [document for input 
format](../getting_started/input-format.html)).
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](../getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](../getting_started/input-format.html#categorical-features)
 and 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214223029
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
--- End diff --

Insert here something like..

You can create this table as follows:

```sql
create table if not exists purchase_history as ..
```


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r214223937
  
--- Diff: docs/gitbook/supervised_learning/tutorial.md ---
@@ -0,0 +1,461 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history as
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_history;
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
+
+```sql
+select hivemall_version();
+```
+
+> "0.5.1-incubating-SNAPSHOT"
+
+Below we list ML and relevant problems that Hivemall can solve:
+
+- [Binary and multi-class classification](../binaryclass/general.html)
+- [Regression](../regression/general.html)
+- [Recommendation](../recommend/cf.html)
+- [Anomaly detection](../anomaly/lof.html)
+- [Natural language processing](../misc/tokenizer.html)
+- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
+- [Data sketching](../misc/funcs.html#sketching)
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
--- End diff --

Better to insert the following sentence after the example.

Feature index and feature value are separated by comma. When comma is 
omitted, the value is considered to be `1.0`. So, a categorical feature 
`gender#male` a [one-hot 
representation](https://www.quora.com/What-is-one-hot-encoding-and-when-is-it-used-in-data-science)
 of `index := gender#male` and `value := 1.0`. Note that `#` is not a special 
charactor.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213963105
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213962866
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213962767
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213948036
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213941931
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
--- End diff --

Just put links as much as you can :)

- 
[Recommendation](http://hivemall.incubator.apache.org/userguide/recommend/cf.html)
- [Anomaly 
detection](http://hivemall.incubator.apache.org/userguide/anomaly/lof.html)
- 
[Clustering](http://hivemall.incubator.apache.org/userguide/clustering/lda.html)
- [NLP](http://hivemall.incubator.apache.org/userguide/misc/tokenizer.html)
- etc...


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213940635
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
--- End diff --

I can't find general pages for recommendation, anomaly detection and 
clustering. also description for NLP, Data sketching (found in function list), 
Evaluation are not found.

Do I need add whole pages or is it enough to link as much as I can?


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-30 Thread chezou
Github user chezou commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213918651
  
--- Diff: docs/gitbook/SUMMARY.md ---
@@ -25,6 +25,7 @@
 * [Installation](getting_started/installation.md)
 * [Install as permanent 
functions](getting_started/permanent-functions.md)
 * [Input Format](getting_started/input-format.md)
+* [Step-by-Step Tutorial on Supervised 
Learning](getting_started/tutorial.md)
--- End diff --

Will move under `Supervised Learning`.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213892214
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
--- End diff --

For all elements in the list, put links to corresponding document page 
like: [Binary 
Classification](http://hivemall.incubator.apache.org/userguide/binaryclass/general.html)


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213896045
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213897864
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213898251
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213898101
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213892353
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
--- End diff --

Add a link to [Input 
Format](http://hivemall.incubator.apache.org/userguide/getting_started/input-format.html)


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213898139
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213891358
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
--- End diff --

Tail semicolon `;`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213895853
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213892034
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
--- End diff --

`s/and TD//`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213891598
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
--- End diff --

Use relative path for internal links like: 

```
On the TD console, Hivemall function 
[`hivemall_version()`](../misc/funcs.html#others) shows ...`
```

Same for the other internal links.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213891146
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
--- End diff --

`s/log/history/` and missing tail semicolon: 

```sql
select count(1) from purchase_history;
```


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890176
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
+
+- Binary and multi-class classification
+- Regression
+- Recommendation
+- Anomaly detection
+- Natural language processing
+- Clustering (i.e., topic modeling)
+- Data sketching
+- Evaluation
+
+Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
+
+This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
+
+## Binary classification
+
+Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
+
+| day\_of\_week | gender | price | category | label |
+|:---:|:---:|:---:|:---:|:---|
+|Saturday | male | 600 | book | 1 |
+|Friday | female | 4800 | sports | 0 |
+|Friday | other | 18000  | entertainment | 0 |
+|Thursday | male | 200 | food | 0 |
+|Wednesday | female | 1000 | electronics | 1 |
+
+Use Hivemall 
[`train_classifier()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#binary-classification)
 UDF to tackle the problem as follows.
+
+### Step 1. Feature representation
+
+First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
+
+To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
+
+- Quantitative feature: `:`
+  - e.g., `price:600.0`
+- Categorical feature: `#`
+  - e.g., `gender#male`
+
+Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
+
+```
+["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
+```
+
+Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](https://hivemall.incubator.apache.org/userguide/getting_started/input-format.html#quantitative-features),
 

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890053
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
+
+```sql
+select hivemall_version()
+```
+
+> "0.5.1-20180703-SNAPSHOT-31924dc" (as of July 23, 2018)
+
+Below we list ML and relevant problems that Hivemall and TD can solve:
--- End diff --

remove `TD`


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890384
  
--- Diff: docs/gitbook/SUMMARY.md ---
@@ -25,6 +25,7 @@
 * [Installation](getting_started/installation.md)
 * [Install as permanent 
functions](getting_started/permanent-functions.md)
 * [Input Format](getting_started/input-format.md)
+* [Step-by-Step Tutorial on Supervised 
Learning](getting_started/tutorial.md)
--- End diff --

Better moved to `Supervised Learning` or `Regression` section or  with 
renaming.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/158#discussion_r213890012
  
--- Diff: docs/gitbook/getting_started/tutorial.md ---
@@ -0,0 +1,493 @@
+
+
+# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
+
+
+
+## What is Hivemall?
+
+[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
+
+```sql
+create table if not exists purchase_history
+(id bigint, day_of_week string, price int, category string, label int)
+;
+```
+
+
+```sql
+insert overwrite table purchase_history
+select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
+union all
+select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
+union all
+select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
+union all
+select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
+union all
+select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
+;
+```
+
+The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
+
+```sql
+select count(1) from purchase_log
+```
+
+> 5
+
+[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
+
+```sql
+SELECT
+  train_classifier(
+features,
+label,
+'-loss_function logloss -optimizer SGD'
+  ) as (feature, weight)
+FROM
+  training
+;
+```
+
+
+On the TD console, Hivemall function 
[`hivemall_version()`](http://hivemall.incubator.apache.org/userguide/misc/funcs.html#others)
 shows current Hivemall version that is available on TD, for example:
--- End diff --

`TD console` should not appear here.


---


[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

2018-08-29 Thread chezou
GitHub user chezou opened a pull request:

https://github.com/apache/incubator-hivemall/pull/158

[HIVEMALL-215] Add step-by-step tutorial on Supervised Learning

## What changes were proposed in this pull request?

In this PR, step by step tutorial is going to be introduced.

## What type of PR is it?

Documentation

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-215


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/chezou/incubator-hivemall tutorial

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #158






---