Re: Pushing functionality to upstream projects (Was: [jira] Resolved: (TIKA-65) Add encode detection support for HTML parser)

Jukka Zitting Mon, 15 Oct 2007 13:47:43 -0700

Hi,

On 10/15/07, Sami Siren <[EMAIL PROTECTED]> wrote:
> Jukka Zitting wrote:
> > On 10/15/07, Sami Siren (JIRA) <[EMAIL PROTECTED]> wrote:
> >> Add encode detection support for HTML parser
> >
> > This feature sounds like something that the upstream HTML parser
> > library might want to do. I'm not sure if NekoHTML is maintained
> > anywhere, but if it was we should probably consider sending a patch
> > for that.
>
> Well I used the same piece of code that you used for txtparser so
> detection/decoding is provided by icu4j and and not Tika. So I was
> basically just using icu for getting properly decoded reader nothing
> fancier than that.


Agreed, it's a relatively simple and straightforward enhancement, but
wouldn't it be useful also for other users of NekoHTML? Also, how
about handling <meta http-equiv='Content-Type'
content='text/html;charset=...'> tags or <?xml version="1.0"
encoding="..."?> prefixes?

IMHO concerns like that are a slippery slope that we should avoid
getting involved with within Tika. It's best if all such knowledge is
embedded in the external parser libraries we use.

BR,

Jukka Zitting

Re: Pushing functionality to upstream projects (Was: [jira] Resolved: (TIKA-65) Add encode detection support for HTML parser)

Reply via email to