[GitHub] incubator-hivemall pull request #110: [HIVEMALL-142] Implement SingularizeUD...

2017-08-27 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/110#discussion_r135446415
  
--- Diff: core/src/main/java/hivemall/utils/lang/StringUtils.java ---
@@ -172,12 +172,17 @@ public static void clear(@Nonnull final StringBuilder 
buf) {
 
 public static String concat(@Nonnull final List list, @Nonnull 
final String sep) {
--- End diff --

@myui I guess you originally assumed this method behaves in a similar way 
to [what `org.apache.commons.lang3.StringUtils.join` 
does](https://github.com/apache/commons-lang/blob/1571050a196198f336ae487ee3b6df629d3ee9da/src/main/java/org/apache/commons/lang3/StringUtils.java#L4106-L4150).
 However, the original code appends a separator even at the end of result 
string as:

- expected: `concat(["a", "b", "c"], "-")` => `a-b-c`
- actual: `concat(["a", "b", "c"], "-")` => `a-b-c-`

So, I fixed the method in 796d388c36c520858b6e61deb34100cb9201e5fa. Is this 
okay? If my assumption was incorrect, I revert the modification and introduce 
alternative method `StringUtils.join()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #110: [HIVEMALL-142] Implement SingularizeUD...

2017-08-27 Thread takuti
GitHub user takuti opened a pull request:

https://github.com/apache/incubator-hivemall/pull/110

[HIVEMALL-142] Implement SingularizeUDF

## What changes were proposed in this pull request?

Implement `singularize(string word)` to obtain singular form of `word`.

The implementation referred the following third-party code:

- 
https://github.com/sundrio/sundrio/blob/95c2b11f7b842bdaa04f61e8e338aea60fb38f70/codegen/src/main/java/io/sundr/codegen/functions/Singularize.java
- 
https://github.com/clips/pattern/blob/3eef00481a4555331cf9a099308910d977f6fc22/pattern/text/en/inflect.py#L445-L623

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-142

## How was this patch tested?

unit test & manual test on EMR

## How to use this feature?

as documented

## Checklist

- [x] Did you apply source code formatter, i.e., `mvn formatter:format`, 
for your commit?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/takuti/incubator-hivemall singularize

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/110.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #110


commit 796d388c36c520858b6e61deb34100cb9201e5fa
Author: Takuya Kitazawa 
Date:   2017-08-28T05:41:43Z

Fix StringUtils.concat() to remove tail unnecessary separator

commit b14ca0975ddc65f0b208ae16734e8f77fb0c126d
Author: Takuya Kitazawa 
Date:   2017-08-28T05:43:51Z

Implement SingularizeUDF




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall issue #107: [HIVEMALL-132] Generalize f1score UDAF to sup...

2017-08-27 Thread nzw0301
Github user nzw0301 commented on the issue:

https://github.com/apache/incubator-hivemall/pull/107
  
@takuti @myui Thank you for your kind comments. 
I completed update based on reviews.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...

2017-08-27 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/107#discussion_r135444542
  
--- Diff: docs/gitbook/eval/multilabel_classification_measures.md ---
@@ -0,0 +1,144 @@
+
+
+
+
+# Multi-label classification
+
+
+Multi-label classification problem is the task to predict the labels given 
categorized dataset.
+Each sample $$i$$ has $$l_i$$ labels, where $$L$$ is the number of unique 
labels in the dataset, and $$0 \leq  l_i \leq |L| $$.
+
+This page focuses on evaluation of the results from such multi-label 
classification problems.
+
+# Example
+
+For the metrics explanation, this page introduces toy example dataset.
+
+## Data
+
+The following table shows the sample of multi-label classification's 
prediction.
+Animal names represent the tags of blog post.
+Left column includes supervised labels,
+Right column includes are predicted labels by a Multi-label classifier.
+
+| truth labels| predicted labels |
+|:---:|:---:|
+|cat, dog | cat, bird |
+| cat, bird | cat, dog |
+| | cat |
+| bird | bird |
+| bird, cat | bird, cat |
+| cat, dog, bird | cat, dog |
+| dog | dog, bird|
+
+
+# Evaluation metrics for multi-label classification
+
+Hivemall provides micro F1-score and micro F-measure.
+
+Define $$L$$ is the set of the tag of blog posts, and 
+$$l_i$$ is a tag set of $$i$$th document.
+In the same manner,
+$$p_i$$ is a predicted tag set of $$i$$th document.
+
+## Micro F1-score
+
+F1-score is the harmonic mean of recall and precision.
+
+The value is computed by the following equation:
+
+$$
+\mathrm{F}_1 = 2 \frac
+{\sum_i |l_i \cap p_i |}
+{ 2* \sum_i |l_i \cap p_i | + \sum_i |l_i - p_i | + \sum_i |p_i - l_i | }
+$$
+
+The Following query shows the example to obtain F1-score.
+
+```sql
+WITH data as (
+  select array("cat", "dog") as actual, array("cat", "bird") as 
predicted
+union all
+  select array("cat", "bird")as actual, array("cat", "dog")  as 
predicted
+union all
+  select array() as actual, array("cat") as 
predicted
+union all
+  select array("bird")   as actual, array("bird")as 
predicted
+union all
+  select array("bird", "cat")as actual, array("bird", "cat") as 
predicted
+union all
+  select array("cat", "dog", "bird") as actual, array("cat", "dog")  as 
predicted
+union all
+  select array("dog")as actual, array("dog", "bird") as 
predicted
+)
+select
+  f1score(actual, predicted)
+from data
+;
+
+--- 0.6956521739130435;
+```
+
+## Micro F-measure
+
+
+F-measure is generalized F1-score and the weighted harmonic mean of recall 
and precision.
+
+The value is computed by the following equation:
+$$
+\mathrm{F}_{\beta} = (1+\beta^2) \frac
+{\sum_i |l_i \cap p_i |}
+{ \beta^2 (\sum_i |l_i \cap p_i | + \sum_i |p_i - l_i |) + \sum_i |l_i 
\cap p_i | + \sum_i |l_i - p_i |}
--- End diff --

Thanks! it is wrong equation. (Since my test code is wrong order.)
I also changed `FMeasureAggregationBuffer.denom()` to understand easily.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...

2017-08-27 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/107#discussion_r135444092
  
--- Diff: docs/gitbook/eval/multilabel_classification_measures.md ---
@@ -0,0 +1,144 @@
+
+
+
+
+# Multi-label classification
+
+
+Multi-label classification problem is the task to predict the labels given 
categorized dataset.
+Each sample $$i$$ has $$l_i$$ labels, where $$L$$ is the number of unique 
labels in the dataset, and $$0 \leq  l_i \leq |L| $$.
--- End diff --

Yes, I fixed it. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #107: [HIVEMALL-132] Generalize f1score UDAF...

2017-08-27 Thread nzw0301
Github user nzw0301 commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/107#discussion_r135432462
  
--- Diff: core/src/main/java/hivemall/evaluation/FMeasureUDAF.java ---
@@ -18,118 +18,387 @@
  */
 package hivemall.evaluation;
 
-import hivemall.utils.hadoop.WritableUtils;
+import hivemall.UDAFEvaluatorWithOptions;
+import hivemall.utils.hadoop.HiveUtils;
 
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
 import java.util.List;
 
+import hivemall.utils.lang.Primitives;
+import org.apache.commons.cli.CommandLine;
+import org.apache.commons.cli.Options;
+
 import org.apache.hadoop.hive.ql.exec.Description;
-import org.apache.hadoop.hive.ql.exec.UDAF;
-import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
+import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
+import org.apache.hadoop.hive.ql.metadata.HiveException;
+import org.apache.hadoop.hive.ql.parse.SemanticException;
+import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
 import org.apache.hadoop.hive.serde2.io.DoubleWritable;
-import org.apache.hadoop.io.IntWritable;
-
-@SuppressWarnings("deprecation")
-@Description(name = "f1score",
-value = "_FUNC_(array[int], array[int]) - Return a F-measure/F1 
score")
-public final class FMeasureUDAF extends UDAF {
-
-public static class Evaluator implements UDAFEvaluator {
-
-public static class PartialResult {
-long tp;
-/** tp + fn */
-long totalAcutal;
-/** tp + fp */
-long totalPredicted;
-
-PartialResult() {
-this.tp = 0L;
-this.totalPredicted = 0L;
-this.totalAcutal = 0L;
-}
+import org.apache.hadoop.hive.serde2.objectinspector.*;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.BooleanObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.IntObjectInspector;
+import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
+import org.apache.hadoop.io.LongWritable;
 
-void updateScore(final List actual, final 
List predicted) {
-final int numActual = actual.size();
-final int numPredicted = predicted.size();
-int countTp = 0;
-for (int i = 0; i < numPredicted; i++) {
-IntWritable p = predicted.get(i);
-if (actual.contains(p)) {
-countTp++;
-}
+import javax.annotation.Nonnull;
+
+@Description(
+name = "fmeasure",
+value = "_FUNC_(array | int | boolean, array | int | boolean, 
String) - Return a F-measure (f1score is the special with beta=1.)")
+public final class FMeasureUDAF extends AbstractGenericUDAFResolver {
+@Override
+public GenericUDAFEvaluator getEvaluator(@Nonnull TypeInfo[] typeInfo) 
throws SemanticException {
+if (typeInfo.length != 2 && typeInfo.length != 3) {
+throw new UDFArgumentTypeException(typeInfo.length - 1,
+"_FUNC_ takes two or three arguments");
+}
+
+boolean isArg1ListOrIntOrBoolean = 
HiveUtils.isListTypeInfo(typeInfo[0])
+|| HiveUtils.isIntegerTypeInfo(typeInfo[0])
+|| HiveUtils.isBooleanTypeInfo(typeInfo[0]);
+if (!isArg1ListOrIntOrBoolean) {
+throw new UDFArgumentTypeException(0,
+"The first argument `array/int/boolean actual` is invalid 
form: " + typeInfo[0]);
+}
+
+boolean isArg2ListOrIntOrBoolean = 
HiveUtils.isListTypeInfo(typeInfo[1])
+|| HiveUtils.isIntegerTypeInfo(typeInfo[1])
+|| HiveUtils.isBooleanTypeInfo(typeInfo[1]);
+if (!isArg2ListOrIntOrBoolean) {
+throw new UDFArgumentTypeException(1,
+"The first argument `array/int/boolean actual` is invalid 
form: " + typeInfo[1]);
+}
+
+if (typeInfo[0] != typeInfo[1]) {
+throw new UDFArgumentTypeException(1, "The first argument's 
`actual` type is "
++ typeInfo[0] + ", but the second argument 
`predicated`'s type is not match: "
++ typeInfo[1]);
+}
+
+return new Evaluator();
+}
+
+public static class Evaluator extends UDAFEvaluat

[jira] [Created] (HIVEMALL-142) Implement SingularizeUDF for English singular-ization

2017-08-27 Thread Takuya Kitazawa (JIRA)
Takuya Kitazawa created HIVEMALL-142:


 Summary: Implement SingularizeUDF for English singular-ization
 Key: HIVEMALL-142
 URL: https://issues.apache.org/jira/browse/HIVEMALL-142
 Project: Hivemall
  Issue Type: New Feature
Reporter: Takuya Kitazawa
Assignee: Takuya Kitazawa


Something like `singularize('movies')` => `'movie'` could be very useful in a 
combination of `tokenize()` for English NLP on Hivemall. 

Implementation  mostly relies on regexp as:

* Jave example: 
https://github.com/sundrio/sundrio/blob/master/codegen/src/main/java/io/sundr/codegen/functions/Singularize.java
* One of the most famous Python implementation 
https://github.com/clips/pattern/blob/master/pattern/text/en/inflect.py#L445-L623



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVEMALL-124) NDCG - BinaryResponseMeasure "fix"

2017-08-27 Thread Takuya Kitazawa (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVEMALL-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143306#comment-16143306
 ] 

Takuya Kitazawa commented on HIVEMALL-124:
--

[~uhyonc] Hi, how is your progress on this issue?

> NDCG - BinaryResponseMeasure "fix"
> --
>
> Key: HIVEMALL-124
> URL: https://issues.apache.org/jira/browse/HIVEMALL-124
> Project: Hivemall
>  Issue Type: Improvement
>Reporter: Uhyon Chung
>Assignee: Takuya Kitazawa
>
> There's a small issue which makes it a bit hard to use the NDCG@x
> from BinaryResponseMeasure.java
> {code:java}
> public static double nDCG(@Nonnull final List rankedList,
> @Nonnull final List groundTruth, @Nonnull final int 
> recommendSize) {
> double dcg = 0.d;
> double idcg = IDCG(Math.min(recommendSize, groundTruth.size()));
> ...
> public static double IDCG(final int n) {
> double idcg = 0.d;
> for (int i = 0; i < n; i++) {
> idcg += Math.log(2) / Math.log(i + 2);
> }
> return idcg;
> }
> {code}
> You'll notice that the way it calculates the idcg for binary NDCG calculation 
> is that it uses the count in groundTruth. The problem is that when we use 
> "recommendSize" (e.g. NDCG@10) we may pass all the ground Truth and not just 
> the ones in the first 10. This is a bit unexpected. Of course, we could just 
> limit the truths using array intersection and what not, but the users 
> shouldn't really have to do that. You can simply just count the # of matched 
> ground truths so it's easier to use this function.
> e.g.
> {code:java}
> public static double nDCG(@Nonnull final List rankedList,
> @Nonnull final List groundTruth, @Nonnull final int 
> recommendSize) {
> double dcg = 0.d;
> int matchedGroundTruths = 0;
> for (int i = 0, n = recommendSize; i < n; i++) {
> Object item_id = rankedList.get(i);
> if (!groundTruth.contains(item_id)) {
> continue;
> }
> int rank = i + 1;
> dcg += Math.log(2) / Math.log(rank + 1);
> matchedGroundTruths++;
> }
> double idcg = IDCG(matchedGroundTruths);
> return dcg / idcg;
> }
> {code}
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)