[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2017-03-26 Thread Michael Braun (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942446#comment-15942446
 ] 

Michael Braun commented on LUCENE-7525:
---

Should a new utility Char-To-Int map be created in lucene-core utils to store 
the char->int (combined chars) map?

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFoldingFilter.java, ASCIIFolding.java, 
> LUCENE-7525.patch, LUCENE-7525.patch, TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2017-01-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842914#comment-15842914
 ] 

Adrien Grand commented on LUCENE-7525:
--

+1 [~steve_rowe]

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFoldingFilter.java, ASCIIFolding.java, 
> LUCENE-7525.patch, LUCENE-7525.patch, TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2017-01-27 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15842901#comment-15842901
 ] 

Steve Rowe commented on LUCENE-7525:


{quote}
I think, we can for now replace the large switch statement with a resource 
file. I'd have 2 ideas:
# A UTF-8 encoded file with 2 columns: first column is a single char, 2nd 
column is a series of replacements. I don't really like this approach as it is 
very sensitive to corrumption by editors and hard to commit correct
# A simple file like int => int,int,int // comment, this is easy to parse and 
convert, but backside is that its harder to read the codepoints (for that we 
have a comment)
{quote}

I wrote a Perl script to create {{mapping-FoldToASCII.txt}}, which is usable 
with {{MappingCharFilter}}, from the {{ASCIIFoldingFilter}} code - the script 
is actually embedded in that file, which is included in several of Solr's 
example configsets, e.g. under 
{{solr/server/solr/configsets/sample_techproducts_configs/conf/}}.  Maybe this 
file could be used directly?  It's human friendly, so would allow for easy user 
customization.

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFoldingFilter.java, ASCIIFolding.java, 
> LUCENE-7525.patch, TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2017-01-19 Thread Michael Braun (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830662#comment-15830662
 ] 

Michael Braun commented on LUCENE-7525:
---

We are still experiencing this - is there someone actively working on getting a 
solution committed? 

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFoldingFilter.java, ASCIIFolding.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-30 Thread Karl von Randow (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15620484#comment-15620484
 ] 

Karl von Randow commented on LUCENE-7525:
-

It seemed to me that we could try to generalise the FST creation from 
{{MappingCharFilter}} and use it in both places; therefore retaining the 
existing differences between the two classes. I haven't tried that route as the 
switch-split was simple, minimal, and addressed the issue :-)

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-29 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618179#comment-15618179
 ] 

Uwe Schindler commented on LUCENE-7525:
---

bq. I thought the ASCII Filter was basically a legacy filter and it grew into 
MappingCharFilter

I generally prefer ASCIIFoldingFilter because it allows you much more 
flexibility. The problem with MappingCharFilter is that you can only apply it 
*before* tokenization. And this is the major downside: If you have tokenizers 
that rely on language specific stuff (Asian like Japanese, Chinese,...) it is a 
bad idea to do such stuff like before the tokenization. Also if you do 
stemming: Stemming in France breaks if you remove accents before! So 
ASCIIFoldingFilter is in most analysis chains the very last filter!

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-29 Thread Alexandre Rafalovitch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618059#comment-15618059
 ] 

Alexandre Rafalovitch commented on LUCENE-7525:
---

I thought the ASCII Filter was basically a legacy filter and it grew into 
MappingCharFilterFactory. Now, we are talking about making ASCII Filter even 
more like Mapping Char Filter. Why would we need both then? I know one is Token 
Filter and another Char filter, but I am not sure it is sufficiently important 
for this discussion.


> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-29 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618031#comment-15618031
 ] 

Uwe Schindler commented on LUCENE-7525:
---

I think, we can for now replace the large switch statement with a resource 
file. I'd have 2 ideas:
- A UTF-8 encoded file with 2 columns: first column is a single char, 2nd 
column is a series of replacements. I don't really like this approach as it is 
very sensitive and hard to commit
- A simple file like {{int => int,int,int // comment}}, this is easy to parse 
and convert, but backside is that its harder to read the codepoints (for that 
we have a comment)

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-29 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618025#comment-15618025
 ] 

Uwe Schindler commented on LUCENE-7525:
---

I am not sure if an FST is not like to use a sledgehammer to crack a nut :-) We 
just need a lookup for single ints in a for-loop and replace those ints with a 
sequence of other ints.

I will check the ICU source code to se what they are doing.

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-29 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15618018#comment-15618018
 ] 

Michael McCandless commented on LUCENE-7525:


Maybe an FST, like {{MappingCharFilter}}?

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-29 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15617679#comment-15617679
 ] 

Uwe Schindler commented on LUCENE-7525:
---

I'd suggest to use the simple binary search approach, but without generated 
code. I'd suggest to convert the large switch statement once to a simple text 
file and load it as resource in static initializer.

This allows to maybe further extend the folding filter so people can use their 
own mappings by pointing to an input stream!

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-28 Thread Ahmet Arslan (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15616922#comment-15616922
 ] 

Ahmet Arslan commented on LUCENE-7525:
--

Can workings of ICUFoldingFilter give any insight here?

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
> Attachments: ASCIIFolding.java, ASCIIFoldingFilter.java, 
> TestASCIIFolding.java
>
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-27 Thread Karl von Randow (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613746#comment-15613746
 ] 

Karl von Randow commented on LUCENE-7525:
-

I have created a version that uses static arrays and a binary search. This 
performs similarly to splitting the existing code into two methods; splitting 
the switch cases between the methods, and delegating to the second method from 
the first's {{default:}} block. Not surprising, as the switch of that size (and 
breadth of values) is a binary search, but native.

I'll try to attach the code examples of the two approaches. These two 
approaches are perhaps a simpler change than going to a new lookup structure, 
and I think will be more performant (as they're a binary search over an array)?

The simplest change is the switch-split. It will also be the most attractive in 
the Git history (as the method is literally split in two about half-way through 
the case statements). It will also be the easiest to prove that the behaviour 
is the same as before.

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-27 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612234#comment-15612234
 ] 

Uwe Schindler commented on LUCENE-7525:
---

I'd make this huge switch statement a different lookup structure, e.g., a 
HashMap or similar (possibly/maybe with Java9 Lambdas as values).

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-27 Thread Michael Braun (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612160#comment-15612160
 ] 

Michael Braun commented on LUCENE-7525:
---

Was just profiling indexing yesterday on our side and noticed the exact same 
thing -  would love to see the performance of this method improved! 

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7525) ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method size

2016-10-27 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612135#comment-15612135
 ] 

Adrien Grand commented on LUCENE-7525:
--

This is probably one of the most used token filters, so +1 to explore if/how we 
can make it faster. Maybe we should add an analyzer that has case folding 
enabled to http://people.apache.org/~mikemccand/lucenebench/analyzers.html.

> ASCIIFoldingFilter.foldToASCII performance issue due to large compiled method 
> size
> --
>
> Key: LUCENE-7525
> URL: https://issues.apache.org/jira/browse/LUCENE-7525
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Affects Versions: 6.2.1
>Reporter: Karl von Randow
>
> The {{ASCIIFoldingFilter.foldToASCII}} method has an enormous switch 
> statement and is too large for the HotSpot compiler to compile; causing a 
> performance problem.
> The method is about 13K compiled, versus the 8KB HotSpot limit. So splitting 
> the method in half works around the problem.
> In my tests splitting the method in half resulted in a 5X performance 
> increase.
> In the test code below you can see how slow the fold method is, even when it 
> is using the shortcut when the character is less than 0x80, compared to an 
> inline implementation of the same shortcut.
> So a workaround is to split the method. I'm happy to provide a patch. It's a 
> hack, of course. Perhaps using the {{MappingCharFilterFactory}} with an input 
> file as per SOLR-2013 would be a better replacement for this method in this 
> class?
> {code:java}
> public class ASCIIFoldingFilterPerformanceTest {
>   private static final int ITERATIONS = 1_000_000;
>   @Test
>   public void testFoldShortString() {
>   char[] input = "testing".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testFoldShortAccentedString() {
>   char[] input = "éúéúøßüäéúéúøßüä".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   ASCIIFoldingFilter.foldToASCII(input, 0, output, 0, 
> input.length);
>   }
>   }
>   @Test
>   public void testManualFoldTinyString() {
>   char[] input = "t".toCharArray();
>   char[] output = new char[input.length * 4];
>   for (int i = 0; i < ITERATIONS; i++) {
>   int k = 0;
>   for (int j = 0; j < 1; ++j) {
>   final char c = input[j];
>   if (c < '\u0080') {
>   output[k++] = c;
>   } else {
>   Assert.assertTrue(false);
>   }
>   }
>   }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org