subject:"Encoding problem with ExtractRequestHandler for HTML indexing"

Re: Encoding problem with ExtractRequestHandler for HTML indexing

2010-03-24 Thread Teruhiko Kurosaka

I suppose you mean Extract_ing_RequestHandler. Out of curiosity, I sent in a Japanese HTML file of EUC-JP encoding, and it converted to Unicode properly and the index has correct Japanese words. Does your HTML files have META tag for Content-type with the value having charset= ? For example,

Encoding problem with ExtractRequestHandler for HTML indexing

2010-03-21 Thread Ukyo Virgden

Hi, I'm trying to index HTML documents with different encodings. My html are either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses all html in their respective encodings and indexes. However on the web interface I'm developing I enter query terms in UTF-8 which naturally does