[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-05 Thread Levente Torok (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16074864#comment-16074864
 ] 

Levente Torok edited comment on SPARK-11069 at 7/5/17 2:39 PM:
---

PySpark interface doesn't have this function implemented, this is why I am 
mislead. Sorry. What can I do ?


was (Author: levente.torok.ge):
PySpark interface doesn't have this function implemented, this is why I am 
mislead.

> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073987#comment-16073987
 ] 

yuhao yang edited comment on SPARK-11069 at 7/4/17 6:32 PM:


   [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be 
consistent with Tokenizer and accommodate the general user scenarios. The 
change of behavior was documented in the release notes of 1.6. 
https://spark.apache.org/releases/spark-release-1-6-0.html

You can disable it by setting toLowerCase to false.
 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*



was (Author: yuhaoyan):
   [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be 
consistent with Tokenizer and accommodate the general user scenarios. The 
change of behavior was documented in the release notes of 1.6. 
https://spark.apache.org/releases/spark-release-1-6-0.html

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*


> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-04 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073987#comment-16073987
 ] 

yuhao yang edited comment on SPARK-11069 at 7/4/17 6:31 PM:


   [~levente.torok.ge] toLowerCase is set to true by default from 1.6+ to be 
consistent with Tokenizer and accommodate the general user scenarios. The 
change of behavior was documented in the release notes of 1.6. 
https://spark.apache.org/releases/spark-release-1-6-0.html

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*



was (Author: yuhaoyan):
   [~levente.torok.ge] use

 val regexTokenizer = new RegexTokenizer()
 *{color:red} .setToLowercase(false){color}*


> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2017-07-04 Thread Levente Torok (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073508#comment-16073508
 ] 

Levente Torok edited comment on SPARK-11069 at 7/4/17 11:36 AM:


With this modification, in v1.6.x, there is no way to tokenize w/o conversion. 
So this modification sucks.

So if "toLowercase" option is not implemented, as it is now, it is still better 
to have no conversion at all, since one can convert before using it if he/she 
wants, but one cannot use w/o conversion if he/she doesn't want.




was (Author: levente.torok.ge):
With this modification, in v1.6.x, there is no way to tokenize w/o. So this 
modification sucks.

So if "toLowercase" option is not implemented, as it is now, it is still better 
to have no conversion at all, since one can convert before using it if he/she 
wants, but one cannot use w/o conversion if he/she doesn't want.



> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2015-10-12 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954185#comment-14954185
 ] 

yuhao yang edited comment on SPARK-11069 at 10/13/15 5:11 AM:
--

I'll try to do it and test with several cases.  Updates will be posted here if 
anything unexpected found. Actually sklearn converts to lowercase before 
tokenizing too. 
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html


was (Author: yuhaoyan):
I'll try to do it and test with several cases.  Updates will be posted here if 
anything unexpected found. Actually sklearn converts to lowercase before 
tokenizing too.

> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

2015-10-12 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954185#comment-14954185
 ] 

yuhao yang edited comment on SPARK-11069 at 10/13/15 5:11 AM:
--

I'll try to do it and test with several cases.  Updates will be posted here if 
anything unexpected found. Actually sklearn converts to lowercase before 
tokenizing too.


was (Author: yuhaoyan):
I'll try to do it and test with several cases.  Updates will be posted here if 
anything unexpected found.

> Add RegexTokenizer option to convert to lowercase
> -
>
> Key: SPARK-11069
> URL: https://issues.apache.org/jira/browse/SPARK-11069
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer 
> does not.  It would be nice to add an option to RegexTokenizer to convert to 
> lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat 
> upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, 
> they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org