Encoding problem with ExtractRequestHandler for HTML indexing

Ukyo Virgden Sun, 21 Mar 2010 09:46:17 -0700

Hi,

I'm trying to index HTML documents with different encodings. My html are
either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses
all html in their respective encodings and indexes. However on the web
interface I'm developing I enter query terms in UTF-8 which naturally does
not match with content with different encodings. Also the results I see on
my web app is not utf8 encoded as expected.


My question, is there any filter I can use to convert all content extracted
by the handler to UTF-8 prior to indexing?

Does it make sense to write a filter which would convert tokens to UTF-8, or
even is it possible with multiple encodings?

Thanks in advance.
Ukyo

Encoding problem with ExtractRequestHandler for HTML indexing

Reply via email to