[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85935456 @mengxr Thank you for your help with the Java unit tests. As you may have guessed, I'm new to both Scala and Java and I was drowning in it. --- If your project is set

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-23 Thread aborsu985
Github user aborsu985 commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26925182 --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaTokenizerSuite.java --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-19 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83425231 Sorry my commit was a bit hasty. Any automated style checkers to recommend? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82969443 Thank you for the tip, I'll look into the java tests next week when I have some time. But in the meantime. I changed the RegexTokenizer to extend from Tokenizer

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82410900 @mengxr I do not think that LowerCase warrants a transformer but rather it could be incorporated into a larger string to vector transformer that changes a text

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-02 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-76861500 Changed minimum token length to 1 and removed the excluded bit. Added matching param which allows to switch from matching regex to splitting regex. Reduced

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-02 Thread aborsu985
Github user aborsu985 commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25652664 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String, Seq

[GitHub] spark pull request: [ML][FEATURE]RegEx Tokenizer

2015-02-11 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-73856773 @mengxr I changed the title to be more specific about the change and enabled the regex to be configurable (as well as the stopwords). There is an issue

[GitHub] spark pull request: RegEx Tokenizer for mllib

2015-02-10 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-73681076 This is not meant to be a standalone tokenizer but rather part of a pipeline. In that aim it has parameters that can be made to vary in order to decide which

[GitHub] spark pull request: RegEx Tokenizer for mllib

2015-02-10 Thread aborsu985
GitHub user aborsu985 opened a pull request: https://github.com/apache/spark/pull/4504 RegEx Tokenizer for mllib Added a Regex based tokenizer for mllib. Currently the regex is fixed but if I could add a regex type paramater to the paramMap, changing the tokenizer regex

[GitHub] spark pull request: RegEx Tokenizer for mllib

2015-02-10 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-73693526 Do you mean restricting tokens to a predestined set of words? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub