Re: WordDelimiterFilter splits at non-ASCII chars

2008-07-16 Thread Erik Hatcher
On Jul 16, 2008, at 4:33 AM, Stefan Oestreicher wrote: Yes you're right. I was testing with analysis.jsp but it chokes on multibyte chars. I modified the jsp and set the encoding using request.setCharacterEncoding("UTF-8"); and it's working fine. Bug in analysis.jsp? Yeah, it's recently been

RE: WordDelimiterFilter splits at non-ASCII chars

2008-07-16 Thread Stefan Oestreicher
gt; From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of Yonik Seeley > Sent: Tuesday, July 15, 2008 6:29 PM > To: solr-user@lucene.apache.org > Subject: Re: WordDelimiterFilter splits at non-ASCII chars > > On Tue, Jul 15, 2008 at 10:29 AM, Stefan Oestreicher >

Re: WordDelimiterFilter splits at non-ASCII chars

2008-07-15 Thread Yonik Seeley
On Tue, Jul 15, 2008 at 10:29 AM, Stefan Oestreicher <[EMAIL PROTECTED]> wrote: > as I understand the WordDelimiterFilter should split on case changes, word > delimiters and changes from character to digit, but it should not > differentiate between ASCII and multibyte chars. It does however. The wo

Re: WordDelimiterFilter splits at non-ASCII chars

2008-07-15 Thread Shalin Shekhar Mangar
Hi Stefan, I wrote a test case for the problem you described but it is working fine. I used the following definition: What configuration are you using? If it is different, please share it so that I can test with it. On Tue, Jul 15, 2008 at 7:59 PM, Stefan Oestreicher < [EMAIL PROTECTED]> wrote

WordDelimiterFilter splits at non-ASCII chars

2008-07-15 Thread Stefan Oestreicher
Hi, as I understand the WordDelimiterFilter should split on case changes, word delimiters and changes from character to digit, but it should not differentiate between ASCII and multibyte chars. It does however. The word "hälse" (german plural of "neck") gets split into "h", "ä" and "lse", which un