[ https://issues.apache.org/jira/browse/SPARK-5566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14313996#comment-14313996 ]
Augustin Borsu edited comment on SPARK-5566 at 2/11/15 9:58 AM: ---------------------------------------------------------------- https://github.com/apache/spark/pull/4504 I propose a tokenizer loosely based on the NLTK regexTokenizer. I didn't create a standalone tokenizer in mllib that I wrap in ml as I don't think a standalone tokenizer is necessarly needed in mllib but if people disagree I can change that. was (Author: augustinb): We could use a tokenizer like this, but we would need to add regex and Array[String] parameters type to be able to change those aprameters in a crossvalidation. https://github.com/apache/spark/pull/4504 > Tokenizer for mllib package > --------------------------- > > Key: SPARK-5566 > URL: https://issues.apache.org/jira/browse/SPARK-5566 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > > There exist tokenizer classes in the spark.ml.feature package and in the > LDAExample in the spark.examples.mllib package. The Tokenizer in the > LDAExample is more advanced and should be made into a full-fledged public > class in spark.mllib.feature. The spark.ml.feature.Tokenizer class should > become a wrapper around the new Tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org