[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2022-10-08 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17614621#comment-17614621
 ] 

Apache Spark commented on SPARK-9213:
-

User 'lyy-pineapple' has created a pull request for this issue:
https://github.com/apache/spark/pull/38171

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2022-03-09 Thread tonydoen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504016#comment-17504016
 ] 

tonydoen commented on SPARK-9213:
-

[~rxin] [~waterman] [~mridulm80] 

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Major
>  Labels: bulk-closed
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2017-08-29 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16145831#comment-16145831
 ] 

Mridul Muralidharan commented on SPARK-9213:


[~rxin] Curious what happened to this effort - did we find a replacement ? Or 
it is still a TODO which will help ?

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-12 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14742070#comment-14742070
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] I'm working on this, and already have a pull request as you seen.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740472#comment-14740472
 ] 

Apache Spark commented on SPARK-9213:
-

User 'watermen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8715

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-09-10 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739920#comment-14739920
 ] 

Reynold Xin commented on SPARK-9213:


[~waterman] are you still working on this? It is fine if you are not - there 
are other people asking me about whether they could work on this issue.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-14 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696630#comment-14696630
 ] 

Yadong Qi commented on SPARK-9213:
--

I know, thanks!

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696596#comment-14696596
 ] 

Reynold Xin commented on SPARK-9213:


We just need to handle it in the analyzer to rewrite it, and also pattern match 
it in the optimizer. No need to handle it everywhere else, since the analyzer 
will take care of the rewrite.


> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696595#comment-14696595
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] There're many place use Like, It does not matter to check the config 
option everywhere?

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696574#comment-14696574
 ] 

Reynold Xin commented on SPARK-9213:


I'm thinking just have Like for Joni, and then LikeJavaFallback for Java. In 
the analyzer, we replace Like with LikeJavaFallback if the config option sets 
to java.


> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696561#comment-14696561
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] Yes, I will do it as below:
```
case class Like(left: Expression, right: Expression) {
  if (flag) {
JavaLike(left, right)
  } else {
JoniLike(left, right)
  }
}

case class JavaLike(left: Expression, right: Expression)
case class JoniLike(left: Expression, right: Expression)
```
Right?

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696478#comment-14696478
 ] 

Reynold Xin commented on SPARK-9213:


[~waterman] 

I don't think we need to worry too much about code reuse here. In the long run, 
I think we only need one implementation.

However, in the short run, it'd be great to be able to feature flag the regular 
expression engine based on a SQL config flag. In order to do that, we can have 
two classes for each regex function.


> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696325#comment-14696325
 ] 

Yadong Qi commented on SPARK-9213:
--

1. Use Joni regex instead of Java regex, I only need to replace java function.
2. Add Joni regex and keep Java regex, I need to define abstract engine 
trait(JavaRegexEngine/JoniRegexEngine), and due to the return type of their 
functions(for example, matcher(), they all return Matcher, but one is 
java.util.regex.Matcher, and the other is org.joni.Matcher) are not the same, I 
need to rebuild some codes.
I'll try 2 first, like Java/Kryo in serializable.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694952#comment-14694952
 ] 

Reynold Xin commented on SPARK-9213:


Are there any semantic differences between the two?




> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-13 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694928#comment-14694928
 ] 

Yadong Qi commented on SPARK-9213:
--

[~rxin] Use Joni regex instead of java regex, or add Joni regex and keep Java 
regex?

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-11 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682168#comment-14682168
 ] 

Reynold Xin commented on SPARK-9213:


Thanks!


> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-11 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681684#comment-14681684
 ] 

Yadong Qi commented on SPARK-9213:
--

Reynold Xin I will work on this.

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-07-21 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634652#comment-14634652
 ] 

Reynold Xin commented on SPARK-9213:


I think hbase/presto both use joni. We can look at their source code for 
inspiration and experience.


> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org