Found another bug in HTMLPaser. If input is HTML "CANON<br/>Model", CANONMODEL 
will be indexed instead of two words CANON and MODEL.

Thanks,
Kevin



----- Original Message ----
From: Cheng Zhang <[email protected]>
To: [email protected]
Sent: Saturday, January 3, 2009 5:08:44 PM
Subject: Re: search results

It turns out that the org.apache.jackrabbit.extractor.HTMLParser eats all 
digits. in method filterAndJoin, all non-letters are removed. 
Does anybody has any idea why we do so? imo, index "hf100" makes more sense 
than indexing "hf". Or is there anyway I can configure to use my HTMLParser 
instead of the default?

best,
kevin





----- Original Message ----
From: Cheng Zhang <[email protected]>
To: [email protected]
Sent: Saturday, January 3, 2009 3:02:51 PM
Subject: search results

Hi, 

I have a html file as below stored in the repository.


<html><body>Manufacture: CANON<br/>
Model: HF100<br/>
Title: Canon VIXIA hf100 Flash Memory High Definition Camcorder with 12x 
Optical Image Stabilized Zoom<br/>
</body></html>

However, if I search for 'hf100', it returns nothing.

Any suggestion?

Thanks a lot,
Kevin

Reply via email to