[jira] [Commented] (SPARK-20028) Implement NGrams aggregate function

2017-04-27 Thread Chenzhao Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15986646#comment-15986646
 ] 

Chenzhao Guo commented on SPARK-20028:
--

N-gram is a popular concept in NLP field, while Spark currently doesn't support 
using Hive UDAF GenericUDAFnGrams, which is actually a feature missing. 

> Implement NGrams aggregate function
> ---
>
> Key: SPARK-20028
> URL: https://issues.apache.org/jira/browse/SPARK-20028
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Chenzhao Guo
>
> This is the implementation of `ngrams` aggregate expression which is also 
> implemented by Hive. It takes use of n-gram concept in natural language 
> processing to understand texts.
> Currently, Spark doesn't support using Hive UDAF GenericUDAFnGrams, which is 
> actually a feature missing.
> An n-gram is a contiguous subsequence of n item(s) drawn from a given 
> sequence. This expression finds the k most frequent n-grams from one or more 
> sequences. 
> This expression has the pattern of : ngrams(children: Array[Array[String]](or 
> Array[String]), n: Int, k: Int, accuracy: Int), it can be used in conjuction 
> with `sentences` to split the column of String to Array. Among the 
> parameters: 
> Children indicates the 'given sequence' we collect n-grams from;
> N indicates n-gram's element number, size 1 is referred to as a "unigram", 
> size 2 is a "bigram", size 3 is a "trigram"... 
> K indicates top k;
> Accuracy is related to the memory used for frequency estimation, more memory 
> will give more accurate frequency counts.
> A simple example: 
> `SELECT ngrams(array("abc", "abc", "bcd", "abc", "bcd"), 2, 4);` will get
> `[{["abc","bcd"]:2.0}, 
> {["abc","abc"]:1.0}, 
> {["bcd","abc"]:1.0}]`. Because there are four 2-grams for the input which are 
> `["abc", "abc"], ["abc", "bcd"], ["bcd", "abc"], ["abc", "bcd"]`, and 
> `["abc", "bcd"]` occurs 2 times, the other two 2-grams occurs 1 time each, 
> while `["abc","abc"]` is alphabetically before `["bcd","abc"]`, so the answer 
> is like that.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20028) Implement NGrams aggregate function

2017-03-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932254#comment-15932254
 ] 

Apache Spark commented on SPARK-20028:
--

User 'gczsjdy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17359

> Implement NGrams aggregate function
> ---
>
> Key: SPARK-20028
> URL: https://issues.apache.org/jira/browse/SPARK-20028
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Chenzhao Guo
>
> N-grams are subsequences of length N drawn from a longer sequence. The 
> purpose of the ngrams()  is to find the k most frequent n-grams from one or 
> more sequences.
> The aggregation function has the pattern of :
> ngrams(array(or array), int N, int K, int accuracy), 
> where
> the first parameter indicates the 'longer sequence' we collect n-grams from,
> N & K indicates the top k n-grams,
> Accuracy indicates the frequency counting accuracy.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org