[ 
https://issues.apache.org/jira/browse/SPARK-20838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019979#comment-16019979
 ] 

Nick Pentreath commented on SPARK-20838:
----------------------------------------

I think this is a duplicate of SPARK-19668

> Spark ML ngram feature extractor should support ngram range like scikit
> -----------------------------------------------------------------------
>
>                 Key: SPARK-20838
>                 URL: https://issues.apache.org/jira/browse/SPARK-20838
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Nick Lothian
>
> Currently Spark ML ngram extractor requires an ngram size (which default to 
> 2).
> This means that to tokenize to words, bigrams and trigrams (which is pretty 
> common) you need a pipeline like this:
>     tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text")
>     remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), 
> outputCol="words")
>     bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams")
>     trigram = NGram(n=3, inputCol=remover.getOutputCol(), 
> outputCol="trigrams")
>     
>     pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram])
> That's not terrible, but the big problem is that the words, bigrams and 
> trigrams end up in separate fields, and the only way (in pyspark) to combine 
> them is to explode each of the words, bigrams and trigrams field and then 
> union them together.
> In my experience this means it is slower to use this for feature extraction 
> than to use a python UDF. This seems preposterous!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to