Re: lucene highlights wrong word

solprovider Tue, 28 Feb 2006 14:07:17 -0800

On 2/28/06, Lars Geldner <[EMAIL PROTECTED]> wrote:
> I'm using Lenya 1.2.4 with Lucene that has been installed with this version
> of Lenya.
> My Lenya publication is multilingual, i.e. English, German and Portuguese.
> There are no problems when searching the English and the Portuguese variant
> of the publication.
> But when I try to search the German publication the wrong word is
> highlighted at the result page. It seems that the displacement of the
> highlighted word increases with the lenght of the text.
>
> P.S.: The displacement is still the same when using the XSLT/XMAP/XSP files
> from http://solprovider.com/lenya/&cat=Search.
> It seems like this is not a problem of Lucene but in the file
> search-and-results.xsp of Lenya. The problem only occurs if the text to
> search contains the German character ß.
> If, for example, the text contains two ß before the word to search appears
> then the highlighted text in the result page is shifted by two characters to
> the right.
>
> Is there a solution to this problem?


I think the problem is there is no uppercase version of that letter. 
I wrote the following code to be case-insensitive, but it assumes the
length of a word is the same as the length of the word after
uppercasing. I think your issue is a bug in Java.lang.String dropping
characters when there is no equivalent defined for uppercase, but we
can (attempt to) work around it.

Try changing the comparisons to use lowercase.

FILE: search-and-results.xsp
                       if(lfield.name().equals("htmlbody")){
                          String tmphtmlbody = slfield;
//                          String upperhtmlbody = tmphtmlbody.toUpperCase();
                          String lowerhtmlbody = tmphtmlbody.toLowerCase();
                          if(twords != null){
                             Enumeration twordsE = twords.elements();
                             while(twordsE.hasMoreElements()){
                                int last = 0;
                                String word = twordsE.nextElement().toString();
//                                String upperword = word.toUpperCase();
                                String lowerword = word.toLowerCase();
                                int wordLen = word.length();
                                StringBuffer sb = new StringBuffer();
//                                int current =
upperhtmlbody.indexOf(upperword);
                                int current = lowerhtmlbody.indexOf(lowerword);
                                if((current &lt; first) || (first ==
-1)) first = current;
                                while(current &gt; last){
                                  
sb.append(tmphtmlbody.substring(last, current));
                                  
sb.append("~").append(tmphtmlbody.substring(current, current +
wordLen)).append("~");
                                   last = current + wordLen;
//                                   current =
upperhtmlbody.indexOf(upperword, last);
                                   current =
lowerhtmlbody.indexOf(lowerword, last);
                                }
                          }

Going forward:
How many lowercase characters do not have an uppercase version?  How
many uppercase characters do not have a lowercase version?  Can we
confirm Java String.toUpperCase() is dropping characters that do not
have an uppercase version?  Will it do the same with
String.toLowerCase()?  Why does Java have destructive behavior in a
core Class, and why has nobody fixed it?

There is no String.indexOfIgnoreCase().  A full solution should verify
the lengths do not change with toUpperCase(), then check
toLowerCase(), and finally just not be case-insensitive when finding
the keywords.  If the above code works, I will write a better solution
and update the code on my site.

solprovider

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene highlights wrong word

Reply via email to