[GitHub] incubator-hivemall pull request #110: [HIVEMALL-142] Implement SingularizeUD...
Github user takuti commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/110#discussion_r135446415 --- Diff: core/src/main/java/hivemall/utils/lang/StringUtils.java --- @@ -172,12 +172,17 @@ public static void clear(@Nonnull final StringBuilder buf) { public static String concat(@Nonnull final List list, @Nonnull final String sep) { --- End diff -- @myui I guess you originally assumed this method behaves in a similar way to [what `org.apache.commons.lang3.StringUtils.join` does](https://github.com/apache/commons-lang/blob/1571050a196198f336ae487ee3b6df629d3ee9da/src/main/java/org/apache/commons/lang3/StringUtils.java#L4106-L4150). However, the original code appends a separator even at the end of result string as: - expected: `concat(["a", "b", "c"], "-")` => `a-b-c` - actual: `concat(["a", "b", "c"], "-")` => `a-b-c-` So, I fixed the method in 796d388c36c520858b6e61deb34100cb9201e5fa. Is this okay? If my assumption was incorrect, I revert the modification and introduce alternative method `StringUtils.join()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #110: [HIVEMALL-142] Implement SingularizeUD...
GitHub user takuti opened a pull request: https://github.com/apache/incubator-hivemall/pull/110 [HIVEMALL-142] Implement SingularizeUDF ## What changes were proposed in this pull request? Implement `singularize(string word)` to obtain singular form of `word`. The implementation referred the following third-party code: - https://github.com/sundrio/sundrio/blob/95c2b11f7b842bdaa04f61e8e338aea60fb38f70/codegen/src/main/java/io/sundr/codegen/functions/Singularize.java - https://github.com/clips/pattern/blob/3eef00481a4555331cf9a099308910d977f6fc22/pattern/text/en/inflect.py#L445-L623 ## What type of PR is it? Feature ## What is the Jira issue? https://issues.apache.org/jira/browse/HIVEMALL-142 ## How was this patch tested? unit test & manual test on EMR ## How to use this feature? as documented ## Checklist - [x] Did you apply source code formatter, i.e., `mvn formatter:format`, for your commit? You can merge this pull request into a Git repository by running: $ git pull https://github.com/takuti/incubator-hivemall singularize Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-hivemall/pull/110.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #110 commit 796d388c36c520858b6e61deb34100cb9201e5fa Author: Takuya Kitazawa Date: 2017-08-28T05:41:43Z Fix StringUtils.concat() to remove tail unnecessary separator commit b14ca0975ddc65f0b208ae16734e8f77fb0c126d Author: Takuya Kitazawa Date: 2017-08-28T05:43:51Z Implement SingularizeUDF --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall issue #107: [HIVEMALL-132] Generalize f1score UDAF to sup...
Github user nzw0301 commented on the issue: https://github.com/apache/incubator-hivemall/pull/107 @takuti @myui Thank you for your kind comments. I completed update based on reviews. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...
Github user nzw0301 commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/107#discussion_r135444542 --- Diff: docs/gitbook/eval/multilabel_classification_measures.md --- @@ -0,0 +1,144 @@ + + + + +# Multi-label classification + + +Multi-label classification problem is the task to predict the labels given categorized dataset. +Each sample $$i$$ has $$l_i$$ labels, where $$L$$ is the number of unique labels in the dataset, and $$0 \leq l_i \leq |L| $$. + +This page focuses on evaluation of the results from such multi-label classification problems. + +# Example + +For the metrics explanation, this page introduces toy example dataset. + +## Data + +The following table shows the sample of multi-label classification's prediction. +Animal names represent the tags of blog post. +Left column includes supervised labels, +Right column includes are predicted labels by a Multi-label classifier. + +| truth labels| predicted labels | +|:---:|:---:| +|cat, dog | cat, bird | +| cat, bird | cat, dog | +| | cat | +| bird | bird | +| bird, cat | bird, cat | +| cat, dog, bird | cat, dog | +| dog | dog, bird| + + +# Evaluation metrics for multi-label classification + +Hivemall provides micro F1-score and micro F-measure. + +Define $$L$$ is the set of the tag of blog posts, and +$$l_i$$ is a tag set of $$i$$th document. +In the same manner, +$$p_i$$ is a predicted tag set of $$i$$th document. + +## Micro F1-score + +F1-score is the harmonic mean of recall and precision. + +The value is computed by the following equation: + +$$ +\mathrm{F}_1 = 2 \frac +{\sum_i |l_i \cap p_i |} +{ 2* \sum_i |l_i \cap p_i | + \sum_i |l_i - p_i | + \sum_i |p_i - l_i | } +$$ + +The Following query shows the example to obtain F1-score. + +```sql +WITH data as ( + select array("cat", "dog") as actual, array("cat", "bird") as predicted +union all + select array("cat", "bird")as actual, array("cat", "dog") as predicted +union all + select array() as actual, array("cat") as predicted +union all + select array("bird") as actual, array("bird")as predicted +union all + select array("bird", "cat")as actual, array("bird", "cat") as predicted +union all + select array("cat", "dog", "bird") as actual, array("cat", "dog") as predicted +union all + select array("dog")as actual, array("dog", "bird") as predicted +) +select + f1score(actual, predicted) +from data +; + +--- 0.6956521739130435; +``` + +## Micro F-measure + + +F-measure is generalized F1-score and the weighted harmonic mean of recall and precision. + +The value is computed by the following equation: +$$ +\mathrm{F}_{\beta} = (1+\beta^2) \frac +{\sum_i |l_i \cap p_i |} +{ \beta^2 (\sum_i |l_i \cap p_i | + \sum_i |p_i - l_i |) + \sum_i |l_i \cap p_i | + \sum_i |l_i - p_i |} --- End diff -- Thanks! it is wrong equation. (Since my test code is wrong order.) I also changed `FMeasureAggregationBuffer.denom()` to understand easily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...
Github user nzw0301 commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/107#discussion_r135444092 --- Diff: docs/gitbook/eval/multilabel_classification_measures.md --- @@ -0,0 +1,144 @@ + + + + +# Multi-label classification + + +Multi-label classification problem is the task to predict the labels given categorized dataset. +Each sample $$i$$ has $$l_i$$ labels, where $$L$$ is the number of unique labels in the dataset, and $$0 \leq l_i \leq |L| $$. --- End diff -- Yes, I fixed it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...
Github user nzw0301 commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/107#discussion_r135432462 --- Diff: core/src/main/java/hivemall/evaluation/FMeasureUDAF.java --- @@ -18,118 +18,387 @@ */ package hivemall.evaluation; -import hivemall.utils.hadoop.WritableUtils; +import hivemall.UDAFEvaluatorWithOptions; +import hivemall.utils.hadoop.HiveUtils; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; import java.util.List; +import hivemall.utils.lang.Primitives; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.Options; + import org.apache.hadoop.hive.ql.exec.Description; -import org.apache.hadoop.hive.ql.exec.UDAF; -import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; +import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException; +import org.apache.hadoop.hive.ql.exec.UDFArgumentException; +import org.apache.hadoop.hive.ql.metadata.HiveException; +import org.apache.hadoop.hive.ql.parse.SemanticException; +import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver; +import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator; import org.apache.hadoop.hive.serde2.io.DoubleWritable; -import org.apache.hadoop.io.IntWritable; - -@SuppressWarnings("deprecation") -@Description(name = "f1score", -value = "_FUNC_(array[int], array[int]) - Return a F-measure/F1 score") -public final class FMeasureUDAF extends UDAF { - -public static class Evaluator implements UDAFEvaluator { - -public static class PartialResult { -long tp; -/** tp + fn */ -long totalAcutal; -/** tp + fp */ -long totalPredicted; - -PartialResult() { -this.tp = 0L; -this.totalPredicted = 0L; -this.totalAcutal = 0L; -} +import org.apache.hadoop.hive.serde2.objectinspector.*; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory; +import org.apache.hadoop.hive.serde2.objectinspector.primitive.IntObjectInspector; +import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo; +import org.apache.hadoop.io.LongWritable; -void updateScore(final List actual, final List predicted) { -final int numActual = actual.size(); -final int numPredicted = predicted.size(); -int countTp = 0; -for (int i = 0; i < numPredicted; i++) { -IntWritable p = predicted.get(i); -if (actual.contains(p)) { -countTp++; -} +import javax.annotation.Nonnull; + +@Description( +name = "fmeasure", +value = "_FUNC_(array | int | boolean, array | int | boolean, String) - Return a F-measure (f1score is the special with beta=1.)") +public final class FMeasureUDAF extends AbstractGenericUDAFResolver { +@Override +public GenericUDAFEvaluator getEvaluator(@Nonnull TypeInfo[] typeInfo) throws SemanticException { +if (typeInfo.length != 2 && typeInfo.length != 3) { +throw new UDFArgumentTypeException(typeInfo.length - 1, +"_FUNC_ takes two or three arguments"); +} + +boolean isArg1ListOrIntOrBoolean = HiveUtils.isListTypeInfo(typeInfo[0]) +|| HiveUtils.isIntegerTypeInfo(typeInfo[0]) +|| HiveUtils.isBooleanTypeInfo(typeInfo[0]); +if (!isArg1ListOrIntOrBoolean) { +throw new UDFArgumentTypeException(0, +"The first argument `array/int/boolean actual` is invalid form: " + typeInfo[0]); +} + +boolean isArg2ListOrIntOrBoolean = HiveUtils.isListTypeInfo(typeInfo[1]) +|| HiveUtils.isIntegerTypeInfo(typeInfo[1]) +|| HiveUtils.isBooleanTypeInfo(typeInfo[1]); +if (!isArg2ListOrIntOrBoolean) { +throw new UDFArgumentTypeException(1, +"The first argument `array/int/boolean actual` is invalid form: " + typeInfo[1]); +} + +if (typeInfo[0] != typeInfo[1]) { +throw new UDFArgumentTypeException(1, "The first argument's `actual` type is " ++ typeInfo[0] + ", but the second argument `predicated`'s type is not match: " ++ typeInfo[1]); +} + +return new Evaluator(); +} + +public static class Evaluator extends UDAFEvaluat
[jira] [Created] (HIVEMALL-142) Implement SingularizeUDF for English singular-ization
Takuya Kitazawa created HIVEMALL-142: Summary: Implement SingularizeUDF for English singular-ization Key: HIVEMALL-142 URL: https://issues.apache.org/jira/browse/HIVEMALL-142 Project: Hivemall Issue Type: New Feature Reporter: Takuya Kitazawa Assignee: Takuya Kitazawa Something like `singularize('movies')` => `'movie'` could be very useful in a combination of `tokenize()` for English NLP on Hivemall. Implementation mostly relies on regexp as: * Jave example: https://github.com/sundrio/sundrio/blob/master/codegen/src/main/java/io/sundr/codegen/functions/Singularize.java * One of the most famous Python implementation https://github.com/clips/pattern/blob/master/pattern/text/en/inflect.py#L445-L623 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVEMALL-124) NDCG - BinaryResponseMeasure "fix"
[ https://issues.apache.org/jira/browse/HIVEMALL-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143306#comment-16143306 ] Takuya Kitazawa commented on HIVEMALL-124: -- [~uhyonc] Hi, how is your progress on this issue? > NDCG - BinaryResponseMeasure "fix" > -- > > Key: HIVEMALL-124 > URL: https://issues.apache.org/jira/browse/HIVEMALL-124 > Project: Hivemall > Issue Type: Improvement >Reporter: Uhyon Chung >Assignee: Takuya Kitazawa > > There's a small issue which makes it a bit hard to use the NDCG@x > from BinaryResponseMeasure.java > {code:java} > public static double nDCG(@Nonnull final List rankedList, > @Nonnull final List groundTruth, @Nonnull final int > recommendSize) { > double dcg = 0.d; > double idcg = IDCG(Math.min(recommendSize, groundTruth.size())); > ... > public static double IDCG(final int n) { > double idcg = 0.d; > for (int i = 0; i < n; i++) { > idcg += Math.log(2) / Math.log(i + 2); > } > return idcg; > } > {code} > You'll notice that the way it calculates the idcg for binary NDCG calculation > is that it uses the count in groundTruth. The problem is that when we use > "recommendSize" (e.g. NDCG@10) we may pass all the ground Truth and not just > the ones in the first 10. This is a bit unexpected. Of course, we could just > limit the truths using array intersection and what not, but the users > shouldn't really have to do that. You can simply just count the # of matched > ground truths so it's easier to use this function. > e.g. > {code:java} > public static double nDCG(@Nonnull final List rankedList, > @Nonnull final List groundTruth, @Nonnull final int > recommendSize) { > double dcg = 0.d; > int matchedGroundTruths = 0; > for (int i = 0, n = recommendSize; i < n; i++) { > Object item_id = rankedList.get(i); > if (!groundTruth.contains(item_id)) { > continue; > } > int rank = i + 1; > dcg += Math.log(2) / Math.log(rank + 1); > matchedGroundTruths++; > } > double idcg = IDCG(matchedGroundTruths); > return dcg / idcg; > } > {code} > Thanks -- This message was sent by Atlassian JIRA (v6.4.14#64029)