[ https://issues.apache.org/jira/browse/SPARK-20838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019979#comment-16019979 ]
Nick Pentreath commented on SPARK-20838: ---------------------------------------- I think this is a duplicate of SPARK-19668 > Spark ML ngram feature extractor should support ngram range like scikit > ----------------------------------------------------------------------- > > Key: SPARK-20838 > URL: https://issues.apache.org/jira/browse/SPARK-20838 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.1.1 > Reporter: Nick Lothian > > Currently Spark ML ngram extractor requires an ngram size (which default to > 2). > This means that to tokenize to words, bigrams and trigrams (which is pretty > common) you need a pipeline like this: > tokenizer = Tokenizer(inputCol="text", outputCol="tokenized_text") > remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), > outputCol="words") > bigram = NGram(n=2, inputCol=remover.getOutputCol(), outputCol="bigrams") > trigram = NGram(n=3, inputCol=remover.getOutputCol(), > outputCol="trigrams") > > pipeline = Pipeline(stages=[tokenizer, remover, bigram, trigram]) > That's not terrible, but the big problem is that the words, bigrams and > trigrams end up in separate fields, and the only way (in pyspark) to combine > them is to explode each of the words, bigrams and trigrams field and then > union them together. > In my experience this means it is slower to use this for feature extraction > than to use a python UDF. This seems preposterous! -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org