Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-85935456
@mengxr Thank you for your help with the Java unit tests. As you may have
guessed, I'm new to both Scala and Java and I was drowning in it.
---
If your project is set
Github user aborsu985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/4504#discussion_r26925182
--- Diff:
mllib/src/test/java/org/apache/spark/ml/feature/JavaTokenizerSuite.java ---
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-83425231
Sorry my commit was a bit hasty. Any automated style checkers to recommend?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-82969443
Thank you for the tip, I'll look into the java tests next week when I have
some time.
But in the meantime. I changed the RegexTokenizer to extend from Tokenizer
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-82410900
@mengxr I do not think that LowerCase warrants a transformer but rather it
could be incorporated into a larger string to vector transformer that changes a
text
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-76861500
Changed minimum token length to 1 and removed the excluded bit.
Added matching param which allows to switch from matching regex to
splitting regex.
Reduced
Github user aborsu985 commented on a diff in the pull request:
https://github.com/apache/spark/pull/4504#discussion_r25652664
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
---
@@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,
Seq
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-73856773
@mengxr
I changed the title to be more specific about the change and enabled the
regex to be configurable (as well as the stopwords). There is an issue
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-73681076
This is not meant to be a standalone tokenizer but rather part of a
pipeline.
In that aim it has parameters that can be made to vary in order to decide
which
GitHub user aborsu985 opened a pull request:
https://github.com/apache/spark/pull/4504
RegEx Tokenizer for mllib
Added a Regex based tokenizer for mllib.
Currently the regex is fixed but if I could add a regex type paramater to
the paramMap,
changing the tokenizer regex
Github user aborsu985 commented on the pull request:
https://github.com/apache/spark/pull/4504#issuecomment-73693526
Do you mean restricting tokens to a predestined set of words?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
11 matches
Mail list logo