Hi, On 10/15/07, Sami Siren <[EMAIL PROTECTED]> wrote: > Jukka Zitting wrote: > > On 10/15/07, Sami Siren (JIRA) <[EMAIL PROTECTED]> wrote: > >> Add encode detection support for HTML parser > > > > This feature sounds like something that the upstream HTML parser > > library might want to do. I'm not sure if NekoHTML is maintained > > anywhere, but if it was we should probably consider sending a patch > > for that. > > Well I used the same piece of code that you used for txtparser so > detection/decoding is provided by icu4j and and not Tika. So I was > basically just using icu for getting properly decoded reader nothing > fancier than that.
Agreed, it's a relatively simple and straightforward enhancement, but wouldn't it be useful also for other users of NekoHTML? Also, how about handling <meta http-equiv='Content-Type' content='text/html;charset=...'> tags or <?xml version="1.0" encoding="..."?> prefixes? IMHO concerns like that are a slippery slope that we should avoid getting involved with within Tika. It's best if all such knowledge is embedded in the external parser libraries we use. BR, Jukka Zitting
