Re: Encoding problem with ExtractRequestHandler for HTML indexing

2010-03-24 Thread Teruhiko Kurosaka
I suppose you mean Extract_ing_RequestHandler.

Out of curiosity, I sent in a Japanese HTML file of EUC-JP encoding,
and it converted to Unicode properly and the index has correct
Japanese words.

Does your HTML files have META tag for Content-type with the value
having charset= ? For example, this is what I have:
meta http-equiv=Content-Type content=text/html; charset=EUC-JP /


On Mar 21, 2010, at 9:45 AM, Ukyo Virgden wrote:

 Hi,
 
 I'm trying to index HTML documents with different encodings. My html are
 either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses
 all html in their respective encodings and indexes. However on the web
 interface I'm developing I enter query terms in UTF-8 which naturally does
 not match with content with different encodings. Also the results I see on
 my web app is not utf8 encoded as expected.
 
 My question, is there any filter I can use to convert all content extracted
 by the handler to UTF-8 prior to indexing?
 
 Does it make sense to write a filter which would convert tokens to UTF-8, or
 even is it possible with multiple encodings?
 
 Thanks in advance.
 Ukyo


Teruhiko Kuro Kurosaka
RLP + Lucene  Solr = powerful search for global contents



Encoding problem with ExtractRequestHandler for HTML indexing

2010-03-21 Thread Ukyo Virgden
Hi,

I'm trying to index HTML documents with different encodings. My html are
either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses
all html in their respective encodings and indexes. However on the web
interface I'm developing I enter query terms in UTF-8 which naturally does
not match with content with different encodings. Also the results I see on
my web app is not utf8 encoded as expected.

My question, is there any filter I can use to convert all content extracted
by the handler to UTF-8 prior to indexing?

Does it make sense to write a filter which would convert tokens to UTF-8, or
even is it possible with multiple encodings?

Thanks in advance.
Ukyo