Repository: incubator-hivemall Updated Branches: refs/heads/master 2379abb72 -> 2d80faefe
[HIVEMALL-189] Create a list of all functions ## What changes were proposed in this pull request? Create a list of all functions in the documentation. In order to make maintenance easier and simpler, the list is systematically generated by reading `Description` annotation in the code: [takuti/hivemalldoc](https://github.com/takuti/hivemalldoc). In case this list does not look sufficient, let's update `Description` annotation itself and make the code more informative in the future. ## What type of PR is it? Documentation ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-189 Author: Takuya Kitazawa <k.tak...@gmail.com> Closes #143 from takuti/HIVEMALL-189. Project: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/commit/2d80faef Tree: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/tree/2d80faef Diff: http://git-wip-us.apache.org/repos/asf/incubator-hivemall/diff/2d80faef Branch: refs/heads/master Commit: 2d80faefe731447e88bf5b8e65c1c425d28d0c57 Parents: 2379abb Author: Takuya Kitazawa <k.tak...@gmail.com> Authored: Mon Apr 16 22:46:36 2018 +0900 Committer: Makoto Yui <m...@apache.org> Committed: Mon Apr 16 22:46:36 2018 +0900 ---------------------------------------------------------------------- docs/gitbook/SUMMARY.md | 89 ++++--- docs/gitbook/misc/funcs.md | 459 ++++++++++++++++++++++++++++++++ docs/gitbook/misc/generic_funcs.md | 42 ++- 3 files changed, 536 insertions(+), 54 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/2d80faef/docs/gitbook/SUMMARY.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/SUMMARY.md b/docs/gitbook/SUMMARY.md index 0d30ba0..1166218 100644 --- a/docs/gitbook/SUMMARY.md +++ b/docs/gitbook/SUMMARY.md @@ -26,29 +26,31 @@ * [Install as permanent functions](getting_started/permanent-functions.md) * [Input Format](getting_started/input-format.md) +* [List of Functions](misc/funcs.md) + * [Tips for Effective Hivemall](tips/README.md) * [Explicit add_bias() for better prediction](tips/addbias.md) * [Use rand_amplify() to better prediction results](tips/rand_amplify.md) - * [Real-time Prediction on RDBMS](tips/rt_prediction.md) + * [Real-time prediction on RDBMS](tips/rt_prediction.md) * [Ensemble learning for stable prediction](tips/ensemble_learning.md) * [Mixing models for a better prediction convergence (MIX server)](tips/mixserver.md) * [Run Hivemall on Amazon Elastic MapReduce](tips/emr.md) -* [General Hive/Hadoop tips](tips/general_tips.md) +* [General Hive/Hadoop Tips](tips/general_tips.md) * [Adding rowid for each row](tips/rowid.md) * [Hadoop tuning for Hivemall](tips/hadoop_tuning.md) * [Troubleshooting](troubleshooting/README.md) * [OutOfMemoryError in training](troubleshooting/oom.md) - * [SemanticException Generate Map Join Task Error: Cannot serialize object](troubleshooting/mapjoin_task_error.md) + * [SemanticException generate map join task error: Cannot serialize object](troubleshooting/mapjoin_task_error.md) * [Asterisk argument for UDTF does not work](troubleshooting/asterisk.md) * [The number of mappers is less than input splits in Hadoop 2.x](troubleshooting/num_mappers.md) - * [Map-side Join causes ClassCastException on Tez](troubleshooting/mapjoin_classcastex.md) + * [Map-side join causes ClassCastException on Tez](troubleshooting/mapjoin_classcastex.md) ## Part II - Generic Features -* [List of generic Hivemall functions](misc/generic_funcs.md) -* [Efficient Top-K query processing](misc/topk.md) +* [List of Generic Hivemall Functions](misc/generic_funcs.md) +* [Efficient Top-K Query Processing](misc/topk.md) * [Text Tokenizer](misc/tokenizer.md) * [Approximate Aggregate Functions](misc/approx.md) @@ -58,62 +60,61 @@ * [Feature Hashing](ft_engineering/hashing.md) * [Feature Selection](ft_engineering/selection.md) * [Feature Binning](ft_engineering/binning.md) -* [FEATURE PAIRING](ft_engineering/pairing.md) - * [Polynomial Features](ft_engineering/polynomial.md) -* [FEATURE TRANSFORMATION](ft_engineering/ft_trans.md) - * [Feature Vectorization](ft_engineering/vectorization.md) +* [Feature Paring](ft_engineering/pairing.md) + * [Polynomial features](ft_engineering/polynomial.md) +* [Feature Transformation](ft_engineering/ft_trans.md) + * [Feature vectorization](ft_engineering/vectorization.md) * [Quantify non-number features](ft_engineering/quantify.md) * [TF-IDF Calculation](ft_engineering/tfidf.md) ## Part IV - Evaluation * [Binary Classification Metrics](eval/binary_classification_measures.md) - * [Area Under the ROC Curve](eval/auc.md) + * [Area under the ROC curve](eval/auc.md) * [Multi-label Classification Metrics](eval/multilabel_classification_measures.md) -* [Regression metrics](eval/regression.md) +* [Regression Metrics](eval/regression.md) * [Ranking Measures](eval/rank.md) - * [Data Generation](eval/datagen.md) * [Logistic Regression data generation](eval/lr_datagen.md) - + ## Part V - Supervised Learning * [How Prediction Works](misc/prediction.md) - -## Part VI - Binary classification + +## Part VI - Binary Classification * [Binary Classification](binaryclass/general.md) -* [a9a tutorial](binaryclass/a9a.md) +* [a9a Tutorial](binaryclass/a9a.md) * [Data preparation](binaryclass/a9a_dataset.md) * [Logistic Regression](binaryclass/a9a_lr.md) - * [Mini-batch Gradient Descent](binaryclass/a9a_minibatch.md) + * [Mini-batch gradient descent](binaryclass/a9a_minibatch.md) -* [News20 tutorial](binaryclass/news20.md) +* [News20 Tutorial](binaryclass/news20.md) * [Data preparation](binaryclass/news20_dataset.md) * [Perceptron, Passive Aggressive](binaryclass/news20_pa.md) * [CW, AROW, SCW](binaryclass/news20_scw.md) * [AdaGradRDA, AdaGrad, AdaDelta](binaryclass/news20_adagrad.md) * [Random Forest](binaryclass/news20_rf.md) -* [KDD2010a tutorial](binaryclass/kdd2010a.md) +* [KDD2010a Tutorial](binaryclass/kdd2010a.md) * [Data preparation](binaryclass/kdd2010a_dataset.md) * [PA, CW, AROW, SCW](binaryclass/kdd2010a_scw.md) -* [KDD2010b tutorial](binaryclass/kdd2010b.md) +* [KDD2010b Tutorial](binaryclass/kdd2010b.md) * [Data preparation](binaryclass/kdd2010b_dataset.md) * [AROW](binaryclass/kdd2010b_arow.md) -* [Webspam tutorial](binaryclass/webspam.md) +* [Webspam Tutorial](binaryclass/webspam.md) * [Data pareparation](binaryclass/webspam_dataset.md) * [PA1, AROW, SCW](binaryclass/webspam_scw.md) -* [Kaggle Titanic tutorial](binaryclass/titanic_rf.md) +* [Kaggle Titanic Tutorial](binaryclass/titanic_rf.md) -## Part VII - Multiclass classification +## Part VII - Multiclass Classification -* [News20 Multiclass tutorial](multiclass/news20.md) +* [News20 Multiclass Tutorial](multiclass/news20.md) * [Data preparation](multiclass/news20_dataset.md) * [Data preparation for one-vs-the-rest classifiers](multiclass/news20_one-vs-the-rest_dataset.md) * [PA](multiclass/news20_pa.md) @@ -121,7 +122,7 @@ * [Ensemble learning](multiclass/news20_ensemble.md) * [one-vs-the-rest classifier](multiclass/news20_one-vs-the-rest.md) -* [Iris tutorial](multiclass/iris.md) +* [Iris Tutorial](multiclass/iris.md) * [Data preparation](multiclass/iris_dataset.md) * [SCW](multiclass/iris_scw.md) * [Random Forest](multiclass/iris_randomforest.md) @@ -130,34 +131,34 @@ * [Regression](regression/general.md) -* [E2006-tfidf regression tutorial](regression/e2006.md) +* [E2006-tfidf Regression Tutorial](regression/e2006.md) * [Data preparation](regression/e2006_dataset.md) * [Passive Aggressive, AROW](regression/e2006_arow.md) -* [KDDCup 2012 track 2 CTR prediction tutorial](regression/kddcup12tr2.md) +* [KDDCup 2012 Track 2 CTR Prediction Tutorial](regression/kddcup12tr2.md) * [Data preparation](regression/kddcup12tr2_dataset.md) * [Logistic Regression, Passive Aggressive](regression/kddcup12tr2_lr.md) - * [Logistic Regression with Amplifier](regression/kddcup12tr2_lr_amplify.md) + * [Logistic Regression with amplifier](regression/kddcup12tr2_lr_amplify.md) * [AdaGrad, AdaDelta](regression/kddcup12tr2_adagrad.md) ## Part IX - Recommendation * [Collaborative Filtering](recommend/cf.md) - * [Item-based Collaborative Filtering](recommend/item_based_cf.md) + * [Item-based collaborative filtering](recommend/item_based_cf.md) -* [News20 related article recommendation Tutorial](recommend/news20.md) +* [News20 Related Article Recommendation Tutorial](recommend/news20.md) * [Data preparation](multiclass/news20_dataset.md) - * [LSH/Minhash and Jaccard Similarity](recommend/news20_jaccard.md) - * [LSH/Minhash and Brute-Force Search](recommend/news20_knn.md) - * [kNN search using b-Bits Minhash](recommend/news20_bbit_minhash.md) + * [LSH/MinHash and Jaccard similarity](recommend/news20_jaccard.md) + * [LSH/MinHash and brute-force search](recommend/news20_knn.md) + * [kNN search using b-Bits MinHash](recommend/news20_bbit_minhash.md) -* [MovieLens movie recommendation Tutorial](recommend/movielens.md) +* [MovieLens Movie Recommendation Tutorial](recommend/movielens.md) * [Data preparation](recommend/movielens_dataset.md) - * [Item-based Collaborative Filtering](recommend/movielens_cf.md) + * [Item-based collaborative filtering](recommend/movielens_cf.md) * [Matrix Factorization](recommend/movielens_mf.md) * [Factorization Machine](recommend/movielens_fm.md) - * [SLIM for Fast Top-K Recommendation](recommend/movielens_slim.md) - * [10-fold Cross Validation (Matrix Factorization)](recommend/movielens_cv.md) + * [SLIM for fast top-k recommendation](recommend/movielens_slim.md) + * [10-fold cross validation (Matrix Factorization)](recommend/movielens_cv.md) ## Part X - Anomaly Detection @@ -170,7 +171,7 @@ * [Latent Dirichlet Allocation](clustering/lda.md) * [Probabilistic Latent Semantic Analysis](clustering/plsa.md) -## Part XII - GeoSpatial functions +## Part XII - GeoSpatial Functions * [Lat/Lon functions](geospatial/latlon.md) @@ -180,15 +181,15 @@ * [Installation](spark/getting_started/installation.md) * [Binary Classification](spark/binaryclass/index.md) - * [a9a Tutorial for DataFrame](spark/binaryclass/a9a_df.md) - * [a9a Tutorial for SQL](spark/binaryclass/a9a_sql.md) + * [a9a tutorial for DataFrame](spark/binaryclass/a9a_df.md) + * [a9a tutorial for SQL](spark/binaryclass/a9a_sql.md) * [Regression](spark/binaryclass/index.md) - * [E2006-tfidf regression Tutorial for DataFrame](spark/regression/e2006_df.md) - * [E2006-tfidf regression Tutorial for SQL](spark/regression/e2006_sql.md) + * [E2006-tfidf regression tutorial for DataFrame](spark/regression/e2006_df.md) + * [E2006-tfidf regression tutorial for SQL](spark/regression/e2006_sql.md) * [Generic features](spark/misc/misc.md) - * [Top-k Join processing](spark/misc/topk_join.md) + * [Top-k join processing](spark/misc/topk_join.md) * [Other utility functions](spark/misc/functions.md) ## Part XIV - Hivemall on Docker http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/2d80faef/docs/gitbook/misc/funcs.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/funcs.md b/docs/gitbook/misc/funcs.md new file mode 100644 index 0000000..d3b1565 --- /dev/null +++ b/docs/gitbook/misc/funcs.md @@ -0,0 +1,459 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +This page describes a list of Hivemall functions. See also a [list of generic Hivemall functions](./generic_funcs.md) for more general-purpose functions such as array and map UDFs. + +<!-- toc --> + +# Regression + +- `train_arow_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight, float covar> + +- `train_arowe2_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight, float covar> + +- `train_arowe_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight, float covar> + +- `train_pa1_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight> + +- `train_pa1a_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight> + +- `train_pa2_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight> + +- `train_pa2a_regr(array<int|bigint|string> features, float target [, constant string options])` - Returns a relation consists of <{int|bigint|string} feature, float weight> + +- `train_regressor(list<string|int|bigint> features, double label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight> + ``` + Build a prediction model by a generic regressor + ``` + +# Classification + +## Binary classification + +- `kpa_predict(@Nonnull double xh, @Nonnull double xk, @Nullable float w0, @Nonnull float w1, @Nonnull float w2, @Nullable float w3)` - Returns a prediction value in Double + +- `train_arow(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight, float covar> + ``` + Build a prediction model by Adaptive Regularization of Weight Vectors (AROW) binary classifier + ``` + +- `train_arowh(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight, float covar> + ``` + Build a prediction model by AROW binary classifier using hinge loss + ``` + +- `train_classifier(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight> + ``` + Build a prediction model by a generic classifier + ``` + +- `train_cw(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight, float covar> + ``` + Build a prediction model by Confidence-Weighted (CW) binary classifier + ``` + +- `train_kpa(array<string|int|bigint> features, int label [, const string options])` - returns a relation <h int, hk int, float w0, float w1, float w2, float w3> + +- `train_pa(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight> + ``` + Build a prediction model by Passive-Aggressive (PA) binary classifier + ``` + +- `train_pa1(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight> + ``` + Build a prediction model by Passive-Aggressive 1 (PA-1) binary classifier + ``` + +- `train_pa2(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight> + ``` + Build a prediction model by Passive-Aggressive 2 (PA-2) binary classifier + ``` + +- `train_perceptron(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight> + ``` + Build a prediction model by Perceptron binary classifier + ``` + +- `train_scw(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight, float covar> + ``` + Build a prediction model by Soft Confidence-Weighted (SCW-1) binary classifier + ``` + +- `train_scw2(list<string|int|bigint> features, int label [, const string options])` - Returns a relation consists of <string|int|bigint feature, float weight, float covar> + ``` + Build a prediction model by Soft Confidence-Weighted 2 (SCW-2) binary classifier + ``` + +## Multiclass classification + +- `train_multiclass_arow(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar> + ``` + Build a prediction model by Adaptive Regularization of Weight Vectors (AROW) multiclass classifier + ``` + +- `train_multiclass_arowh(list<string|int|bigint> features, int|string label [, const string options])` - Returns a relation consists of <int|string label, string|int|bigint feature, float weight, float covar> + ``` + Build a prediction model by Adaptive Regularization of Weight Vectors (AROW) multiclass classifier using hinge loss + ``` + +- `train_multiclass_cw(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar> + ``` + Build a prediction model by Confidence-Weighted (CW) multiclass classifier + ``` + +- `train_multiclass_pa(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight> + ``` + Build a prediction model by Passive-Aggressive (PA) multiclass classifier + ``` + +- `train_multiclass_pa1(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight> + ``` + Build a prediction model by Passive-Aggressive 1 (PA-1) multiclass classifier + ``` + +- `train_multiclass_pa2(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight> + ``` + Build a prediction model by Passive-Aggressive 2 (PA-2) multiclass classifier + ``` + +- `train_multiclass_perceptron(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight> + ``` + Build a prediction model by Perceptron multiclass classifier + ``` + +- `train_multiclass_scw(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar> + ``` + Build a prediction model by Soft Confidence-Weighted (SCW-1) multiclass classifier + ``` + +- `train_multiclass_scw2(list<string|int|bigint> features, {int|string} label [, const string options])` - Returns a relation consists of <{int|string} label, {string|int|bigint} feature, float weight, float covar> + ``` + Build a prediction model by Soft Confidence-Weighted 2 (SCW-2) multiclass classifier + ``` + +# Matrix factorization + +- `bprmf_predict(List<Float> Pu, List<Float> Qi[, double Bi])` - Returns the prediction value + +- `mf_predict(List<Float> Pu, List<Float> Qi[, double Bu, double Bi[, double mu]])` - Returns the prediction value + +- `train_bprmf(INT user, INT posItem, INT negItem [, String options])` - Returns a relation <INT i, FLOAT Pi, FLOAT Qi [, FLOAT Bi]> + +- `train_mf_adagrad(INT user, INT item, FLOAT rating [, CONSTANT STRING options])` - Returns a relation consists of <int idx, array<float> Pu, array<float> Qi [, float Bu, float Bi [, float mu]]> + +- `train_mf_sgd(INT user, INT item, FLOAT rating [, CONSTANT STRING options])` - Returns a relation consists of <int idx, array<float> Pu, array<float> Qi [, float Bu, float Bi [, float mu]]> + +# Factorization machines + +- `ffm_predict(float Wi, array<float> Vifj, array<float> Vjfi, float Xi, float Xj)` - Returns a prediction value in Double + +- `fm_predict(Float Wj, array<float> Vjf, float Xj)` - Returns a prediction value in Double + +- `train_ffm(array<string> x, double y [, const string options])` - Returns a prediction model + +- `train_fm(array<string> x, double y [, const string options])` - Returns a prediction model + +# Recommendation + +- `train_slim( int i, map<int, double> r_i, map<int, map<int, double>> topKRatesOfI, int j, map<int, double> r_j [, constant string options])` - Returns row index, column index and non-zero weight value of prediction model + +# Anomaly detection + +- `changefinder(double|array<double> x [, const string options])` - Returns outlier/change-point scores and decisions using ChangeFinder. It will return a tuple <double outlier_score, double changepoint_score [, boolean is_anomaly [, boolean is_changepoint]] + +- `sst(double|array<double> x [, const string options])` - Returns change-point scores and decisions using Singular Spectrum Transformation (SST). It will return a tuple <double changepoint_score [, boolean is_changepoint]> + +# Topic modeling + +- `lda_predict(string word, float value, int label, float lambda[, const string options])` - Returns a list which consists of <int label, float prob> + +- `plsa_predict(string word, float value, int label, float prob[, const string options])` - Returns a list which consists of <int label, float prob> + +- `train_lda(array<string> words[, const string options])` - Returns a relation consists of <int topic, string word, float score> + +- `train_plsa(array<string> words[, const string options])` - Returns a relation consists of <int topic, string word, float score> + +# Preprocessing + +## Feature creation + +- `add_bias(feature_vector in array<string>)` - Returns features with a bias in array<string> + +- `add_feature_index(ARRAY[DOUBLE]: dense feature vector)` - Returns a feature vector with feature indices + +- `extract_feature(feature_vector in array<string>)` - Returns features in array<string> + +- `extract_weight(feature_vector in array<string>)` - Returns the weights of features in array<string> + +- `feature(<string|int|long|short|byte> feature, <number> value)` - Returns a feature string + +- `feature_index(feature_vector in array<string>)` - Returns feature indices in array<index> + +- `sort_by_feature(map in map<int,float>)` - Returns a sorted map + +## Data amplification + +- `amplify(const int xtimes, *)` - amplify the input records x-times + +- `rand_amplify(const int xtimes [, const string options], *)` - amplify the input records x-times in map-side + +## Feature binning + +- `build_bins(number weight, const int num_of_bins[, const boolean auto_shrink = false])` - Return quantiles representing bins: array<double> + +- `feature_binning(array<features::string> features, const map<string, array<number>> quantiles_map)` / _FUNC(number weight, const array<number> quantiles) - Returns binned features as an array<features::string> / bin ID as int + +## Feature format conversion + +- `conv2dense(int feature, float weight, int nDims)` - Return a dense model in array<float> + +- `quantify(boolean outout, col1, col2, ...)` - Returns an identified features + +- `to_dense_features(array<string> feature_vector, int dimensions)` - Returns a dense feature in array<float> + +- `to_sparse_features(array<float> feature_vector)` - Returns a sparse feature in array<string> + +## Feature hashing + +- `array_hash_values(array<string> values, [string prefix [, int numFeatures], boolean useIndexAsPrefix])` returns hash values in array<int> + +- `feature_hashing(array<string> features [, const string options])` - returns a hashed feature vector in array<string> + +- `mhash(string word)` returns a murmurhash3 INT value starting from 1 + +- `prefixed_hash_values(array<string> values, string prefix [, boolean useIndexAsPrefix])` returns array<string> that each element has the specified prefix + +- `sha1(string word [, int numFeatures])` returns a SHA-1 value + +## Feature paring + +- `feature_pairs(feature_vector in array<string>, [, const string options])` - Returns a relation <string i, string j, double xi, double xj> + +- `polynomial_features(feature_vector in array<string>)` - Returns a feature vectorhaving polynominal feature space + +- `powered_features(feature_vector in array<string>, int degree [, boolean truncate])` - Returns a feature vector having a powered feature space + +## Ranking + +- `bpr_sampling(int userId, List<int> posItems [, const string options])`- Returns a relation consists of <int userId, int itemId> + +- `item_pairs_sampling(array<int|long> pos_items, const int max_item_id [, const string options])`- Returns a relation consists of <int pos_item_id, int neg_item_id> + +- `populate_not_in(list items, const int max_item_id [, const string options])`- Returns a relation consists of <int item> that item does not exist in the given items + +## Feature scaling + +- `l1_normalize(ftvec string)` - Returned a L1 normalized value + +- `l2_normalize(ftvec string)` - Returned a L2 normalized value + +- `rescale(value, min, max)` - Returns rescaled value by min-max normalization + +- `zscore(value, mean, stddev)` - Returns a standard score (zscore) + +## Feature selection + +- `chi2(array<array<number>> observed, array<array<number>> expected)` - Returns chi2_val and p_val of each columns as <array<double>, array<double>> + +- `snr(array<number> features, array<int> one-hot class label)` - Returns Signal Noise Ratio for each feature as array<double> + +## Feature transformation and vectorization + +- `add_field_indices(array<string> features)` - Returns arrays of string that field indices (<field>:<feature>)* are argumented + +- `binarize_label(int/long positive, int/long negative, ...)` - Returns positive/negative records that are represented as (..., int label) where label is 0 or 1 + +- `categorical_features(array<string> featureNames, feature1, feature2, .. [, const string options])` - Returns a feature vector array<string> + +- `ffm_features(const array<string> featureNames, feature1, feature2, .. [, const string options])` - Takes categroical variables and returns a feature vector array<string> in a libffm format <field>:<index>:<value> + +- `indexed_features(double v1, double v2, ...)` - Returns a list of features as array<string>: [1:v1, 2:v2, ..] + +- `onehot_encoding(PRIMITIVE feature, ...)` - Compute onehot encoded label for each feature + +- `quantified_features(boolean output, col1, col2, ...)` - Returns an identified features in a dense array<double> + +- `quantitative_features(array<string> featureNames, feature1, feature2, .. [, const string options])` - Returns a feature vector array<string> + +- `vectorize_features(array<string> featureNames, feature1, feature2, .. [, const string options])` - Returns a feature vector array<string> + +# Geospatial functions + +- `haversine_distance(double lat1, double lon1, double lat2, double lon2, [const boolean mile=false])`::double - return distance between two locations in km [or miles] using `haversine` formula + ``` + Usage: select latlon_distance(lat1, lon1, lat2, lon2) from ... + ``` + +- `lat2tiley(double lat, int zoom)`::int - Returns the tile number of the given latitude and zoom level + +- `lon2tilex(double lon, int zoom)`::int - Returns the tile number of the given longitude and zoom level + +- `map_url(double lat, double lon, int zoom [, const string option])` - Returns a URL string + ``` + OpenStreetMap: http://tile.openstreetmap.org/${zoom}/${xtile}/${ytile}.png + Google Maps: https://www.google.com/maps/@${lat},${lon},${zoom}z + ``` + +- `tile(double lat, double lon, int zoom)`::bigint - Returns a tile number 2^2n where n is zoom level. + + ``` + _FUNC_(lat,lon,zoom) = xtile(lon,zoom) + ytile(lat,zoom) * 2^zoomrefer http://wiki.openstreetmap.org/wiki/Slippy_map_tilenames for detail + ``` + +- `tilex2lon(int x, int zoom)`::double - Returns longitude of the given tile x and zoom level + +- `tiley2lat(int y, int zoom)`::double - Returns latitude of the given tile y and zoom level + +# Distance measures + +- `angular_distance(ftvec1, ftvec2)` - Returns an angular distance of the given two vectors + +- `cosine_distance(ftvec1, ftvec2)` - Returns a cosine distance of the given two vectors + +- `euclid_distance(ftvec1, ftvec2)` - Returns the square root of the sum of the squared differences: sqrt(sum((x - y)^2)) + +- `hamming_distance(A, B [,int k])` - Returns Hamming distance between A and B + +- `jaccard_distance(A, B [,int k])` - Returns Jaccard distance between A and B + +- `kld(double m1, double sigma1, double mu2, double sigma 2)` - Returns KL divergence between two distributions + +- `manhattan_distance(list x, list y)` - Returns sum(|x - y|) + +- `minkowski_distance(list x, list y, double p)` - Returns sum(|x - y|^p)^(1/p) + +- `popcnt(a [, b])` - Returns a popcount value + +# Locality-sensitive hashing + +- `bbit_minhash(array<> features [, int numHashes])` - Returns a b-bits minhash value + +- `minhash(ANY item, array<int|bigint|string> features [, constant string options])` - Returns n differnce k-depth signatures (i.e., clusteid) for each item <clusteid, item> + +- `minhashes(array<> features [, int numHashes, int keyGroup [, boolean noWeight]])` - Returns minhash values + +# Similarity measures + +- `angular_similarity(ftvec1, ftvec2)` - Returns an angular similarity of the given two vectors + +- `cosine_similarity(ftvec1, ftvec2)` - Returns a cosine similarity of the given two vectors + +- `dimsum_mapper(array<string> row, map<int col_id, double norm> colNorms [, const string options])` - Returns column-wise partial similarities + +- `distance2similarity(float d)` - Returns 1.0 / (1.0 + d) + +- `euclid_similarity(ftvec1, ftvec2)` - Returns a euclid distance based similarity, which is `1.0 / (1.0 + distance)`, of the given two vectors + +- `jaccard_similarity(A, B [,int k])` - Returns Jaccard similarity coefficient of A and B + +# Evaluation + +- `auc(array rankItems | double score, array correctItems | int label [, const int recommendSize = rankItems.size ])` - Returns AUC + +- `average_precision(array rankItems, array correctItems [, const int recommendSize = rankItems.size])` - Returns MAP + +- `f1score(array[int], array[int])` - Return a F1 score + +- `fmeasure(array|int|boolean actual, array|int| boolean predicted [, const string options])` - Return a F-measure (f1score is the special with beta=1.0) + +- `hitrate(array rankItems, array correctItems [, const int recommendSize = rankItems.size])` - Returns HitRate + +- `logloss(double predicted, double actual)` - Return a Logrithmic Loss + +- `mae(double predicted, double actual)` - Return a Mean Absolute Error + +- `mrr(array rankItems, array correctItems [, const int recommendSize = rankItems.size])` - Returns MRR + +- `mse(double predicted, double actual)` - Return a Mean Squared Error + +- `ndcg(array rankItems, array correctItems [, const int recommendSize = rankItems.size])` - Returns nDCG + +- `precision_at(array rankItems, array correctItems [, const int recommendSize = rankItems.size])` - Returns Precision + +- `r2(double predicted, double actual)` - Return R Squared (coefficient of determination) + +- `recall_at(array rankItems, array correctItems [, const int recommendSize = rankItems.size])` - Returns Recall + +- `rmse(double predicted, double actual)` - Return a Root Mean Squared Error + +# Sketching + +- `approx_count_distinct(expr x [, const string options])` - Returns an approximation of count(DISTINCT x) using HyperLogLogPlus algorithm + +# Ensemble learning + +## Utils + +- `argmin_kld(float mean, float covar)` - Returns mean or covar that minimize a KL-distance among distributions + ``` + The returned value is (1.0 / (sum(1.0 / covar))) * (sum(mean / covar) + ``` + +- `max_label(double value, string label)` - Returns a label that has the maximum value + +- `maxrow(ANY compare, ...)` - Returns a row that has maximum value in the 1st argument + +## Bagging + +- `voted_avg(double value)` - Returns an averaged value by bagging for classification + +- `weight_voted_avg(expr)` - Returns an averaged value by considering sum of positive/negative weights + +# Dicision trees and RandomForest + +- `train_gradient_tree_boosting_classifier(array<double|string> features, int label [, string options])` - Returns a relation consists of <int iteration, int model_type, array<string> pred_models, double intercept, double shrinkage, array<double> var_importance, float oob_error_rate> + +- `train_randomforest_classifier(array<double|string> features, int label [, const array<double> classWeights, const string options])` - Returns a relation consists of <int model_id, int model_type, string pred_model, array<double> var_importance, int oob_errors, int oob_tests, double weight> + +- `train_randomforest_regression(array<double|string> features, double target [, string options])` - Returns a relation consists of <int model_id, int model_type, string pred_model, array<double> var_importance, int oob_errors, int oob_tests> + +- `guess_attribute_types(ANY, ...)` - Returns attribute types + ``` + select guess_attribute_types(*) from train limit 1; + > Q,Q,C,C,C,C,Q,C,C,C,Q,C,Q,Q,Q,Q,C,Q + ``` + +- `rf_ensemble(int yhat [, array<double> proba [, double model_weight=1.0]])` - Returns emsebled prediction results in <int label, double probability, array<double> probabilities> + +- `tree_export(string model, const string options, optional array<string> featureNames=null, optional array<string> classNames=null)` - exports a Decision Tree model as javascript/dot] + +- `tree_predict(string modelId, string model, array<double|string> features [, const string options | const boolean classification=false])` - Returns a prediction result of a random forest in <int value, array<double> posteriori> for classification and <double> for regression + +# XGBoost + +- `train_multiclass_xgboost_classifier(string[] features, double target [, string options])` - Returns a relation consisting of <string model_id, array<byte> pred_model> + +- `train_xgboost_classifier(string[] features, double target [, string options])` - Returns a relation consisting of <string model_id, array<byte> pred_model> + +- `train_xgboost_regr(string[] features, double target [, string options])` - Returns a relation consisting of <string model_id, array<byte> pred_model> + +- `xgboost_multiclass_predict(string rowid, string[] features, string model_id, array<byte> pred_model [, string options])` - Returns a prediction result as (string rowid, string label, float probability) + +- `xgboost_predict(string rowid, string[] features, string model_id, array<byte> pred_model [, string options])` - Returns a prediction result as (string rowid, float predicted) + +# Others + +- `hivemall_version()` - Returns the version of Hivemall + +- `lr_datagen(options string)` - Generates a logistic regression dataset + ```sql + WITH dual AS (SELECT 1) SELECT lr_datagen('-n_examples 1k -n_features 10') FROM dual; + ``` + +- `tf(string text)` - Return a term frequency in <string, float> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-hivemall/blob/2d80faef/docs/gitbook/misc/generic_funcs.md ---------------------------------------------------------------------- diff --git a/docs/gitbook/misc/generic_funcs.md b/docs/gitbook/misc/generic_funcs.md index b6c7c62..3409f26 100644 --- a/docs/gitbook/misc/generic_funcs.md +++ b/docs/gitbook/misc/generic_funcs.md @@ -16,8 +16,8 @@ specific language governing permissions and limitations under the License. --> - -This page describes a list of useful Hivemall generic functions. + +This page describes a list of useful Hivemall generic functions. See also a [list of machine-learning-related functions](./funcs.md). <!-- toc --> @@ -44,7 +44,7 @@ This page describes a list of useful Hivemall generic functions. ```sql select array_remove(array(1,null,3),array(null)); > [3] - + select array_remove(array("aaa","bbb"),"bbb"); > ["aaa"] ``` @@ -57,7 +57,7 @@ This page describes a list of useful Hivemall generic functions. ``` - `subarray_endwith(array<int|text> original, int|text key)` - Returns an array that ends with the specified key - + ```sql select subarray_endwith(array(1,2,3,4), 3); > [1,2,3] @@ -77,6 +77,12 @@ This page describes a list of useful Hivemall generic functions. > [3,4] ``` +- `float_array(nDims)` - Returns an array<float> of nDims elements + +- `select_k_best(array<number> array, const array<number> importance, const int k)` - Returns selected top-k elements as array<double> + +- `to_string_array(array<ANY>)` - Returns an array of strings + ## Array UDAFs - `array_avg(array<NUMBER>)` - Returns an array<double> in which each element is the mean of a set of numbers @@ -108,10 +114,10 @@ This page describes a list of useful Hivemall generic functions. to_ordered_list(value, key, '-k -2'), -- [donut, (banana | egg)] (tail-k) to_ordered_list(value, key, '-k -100'), -- [donut, (banana, egg | egg, banana), candy, apple] to_ordered_list(value, key, '-k -2 -reverse'), -- [apple, candy] (reverse tail-k = top-k) - to_ordered_list(value, '-k 2'), -- [egg, donut] (alphabetically) + to_ordered_list(value, '-k 2'), -- [egg, donut] (alphabetically) to_ordered_list(key, '-k -2 -reverse'), -- [5, 4] (top-2 keys) to_ordered_list(key) -- [2, 3, 3, 4, 5] (natural ordered keys) - from + from t ; ``` @@ -201,14 +207,28 @@ The compression level must be in range [-1,9] # MapReduce functions +- `distcache_gets(filepath, key, default_value [, parseKey])` - Returns map<key_type, value_type>|value_type + +- `jobconf_gets()` - Returns the value from JobConf + +- `jobid()` - Returns the value of mapred.job.id + - `rowid()` - Returns a generated row id of a form {TASK_ID}-{SEQUENCE_NUMBER} +- `rownum()` - Returns a generated row number in long + - `taskid()` - Returns the value of mapred.task.partition # Math functions +- `l2_norm(double xi)` - Return L2 norm of a vector which has the given values in each dimension + - `sigmoid(x)` - Returns `1.0 / (1.0 + exp(-x))` +# Matrix functions + +- `transpose_and_dot(array<number> matrix0_row, array<number> matrix1_row)` - Returns dot(matrix0.T, matrix1) as array<array<double>>, shape = (matrix0.#cols, matrix1.#cols) + # Text processing functions - `base91(binary)` - Convert the argument from binary to a BASE91 string @@ -230,7 +250,7 @@ The compression level must be in range [-1,9] ```sql select normalize_unicode('ï¾ï¾ï½¶ï½¸ï½¶ï¾ ','NFKC'); > ãã³ã«ã¯ã«ã - + select normalize_unicode('ã±ã§ã¦â ¢','NFKC'); > (æ ª)ãã³ãã«III ``` @@ -249,14 +269,16 @@ The compression level must be in range [-1,9] - `tokenize(string englishText [, boolean toLowerCase])` - Returns words in array<string> -- `tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags])` - returns tokenized strings in array<string>. Refer [this article](../misc/tokenizer.html) for detail. +- `tokenize_ja(String line [, const string mode = "normal", const list<string> stopWords, const list<string> stopTags])` - returns tokenized Japanese string in array<string>. Refer [this article](../misc/tokenizer.html) for detail. ```sql select tokenize_ja("kuromojiã使ã£ãåãã¡æ¸ãã®ãã¹ãã§ãã第äºå¼æ°ã«ã¯normal/search/extendedãæå®ã§ãã¾ããããã©ã«ãã§ã¯normalã¢ã¼ãã§ãã"); - + > ["kuromoji","使ã","åãã¡æ¸ã","ãã¹ã","第","äº","å¼æ°","normal","search","extended","æå®","ããã©ã«ã","normal"," ã¢ã¼ã"] ``` +- `tokenize_cn(String line [, const list<string> stopWords])` - returns tokenized Chinese string in array<string>. Refer [this article](../misc/tokenizer.html) for detail. + - `word_ngrams(array<string> words, int minSize, int maxSize)` - Returns list of n-grams where `minSize <= n <= maxSize` ```sql @@ -275,7 +297,7 @@ The compression level must be in range [-1,9] ```sql select generate_series(1,9); - + 1 2 3