[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85893445 [Test build #29154 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29154/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85890886 [Test build #29153 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29153/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85931868 [Test build #29154 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29154/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85931920 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85928788 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85928723 [Test build #29153 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29153/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85935456 @mengxr Thank you for your help with the Java unit tests. As you may have guessed, I'm new to both Scala and Java and I was drowning in it. --- If your project is set

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-86128922 LGTM. Merged into master. Thanks for contributing! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-25 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4504 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85367964 [Test build #29059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29059/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-24 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85391856 [Test build #29059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/29059/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-85391870 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-84962071 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-84962033 [Test build #28990 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28990/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-23 Thread aborsu985
Github user aborsu985 commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26925182 --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaTokenizerSuite.java --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-23 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-84934562 [Test build #28990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28990/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868590 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868579 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868583 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,67 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868574 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868587 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868562 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,67 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26868568 --- Diff: mllib/src/test/java/org/apache/spark/ml/feature/JavaTokenizerSuite.java --- @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-20 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-84106511 I don't know a formatter that can do everything correctly. I use intellij and with the default Scala code style (except indent 2). I need to manually adjust the

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83475464 [Test build #28862 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28862/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83475548 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-19 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83425231 Sorry my commit was a bit hasty. Any automated style checkers to recommend? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-19 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83425118 [Test build #28862 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28862/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82969443 Thank you for the tip, I'll look into the java tests next week when I have some time. But in the meantime. I changed the RegexTokenizer to extend from Tokenizer

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82968292 [Test build #28798 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28798/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665211 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,67 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665219 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665229 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665224 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665203 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,67 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665234 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665227 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665220 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83004046 [Test build #28798 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28798/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83004074 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r26665217 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/TokenizerSuite.scala --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-18 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-83022231 @aborsu985 Please check the code style and make sure you follow https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide --- If your project is set up

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82405249 [Test build #28727 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28727/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82410900 @mengxr I do not think that LowerCase warrants a transformer but rather it could be incorporated into a larger string to vector transformer that changes a text into a

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82421973 [Test build #28728 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28728/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82541096 You can use https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala as a template for unit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82465490 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82465423 [Test build #28727 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28727/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82481855 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-17 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-82481809 [Test build #28728 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28728/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-12 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-78750052 add to whitelist --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-78777321 Test PASSed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-78777298 [Test build #28544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28544/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-78750861 [Test build #28544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28544/consoleFull) for PR 4504 at commit

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-12 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-78709488 @aborsu985 Sorry for the delay! On the high level, I'm a little concerned with exposing too many parameters in the first version. NLTK's regex tokenizer

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-02 Thread aborsu985
Github user aborsu985 commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-76861500 Changed minimum token length to 1 and removed the excluded bit. Added matching param which allows to switch from matching regex to splitting regex. Reduced

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-03-02 Thread aborsu985
Github user aborsu985 commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25652664 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232685 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232690 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232686 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232692 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232704 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232703 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232681 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232679 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232691 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232697 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232695 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232699 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/4504#discussion_r25232683 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala --- @@ -39,3 +39,66 @@ class Tokenizer extends UnaryTransformer[String,

[GitHub] spark pull request: [ML][FEATURE] SPARK-5566: RegEx Tokenizer

2015-02-23 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/4504#issuecomment-75710113 @aborsu985 I made a pass on the code. Besides my inline comments, please add a unit test. It would be better if you can also add a Java unit test. Thanks! --- If your