Hi folks. What's the motivation to add exactly the number of white spaces after an entity declaration in HTMLStripReader? It basically looks like this:

"lód"

(UTF: lód, "ice" in Polish) is translated into:

"ló       d"

This happens both with numeric entities and named entities. Needless to say, these added spaces in the character stream do no good as they effectively split a single term "lód" into two meaningless terms "l" and "d".

I can fix this in the code easily, but it looks like it was intentional, so before I write test cases and commit a JIRA issue I would like to understand what the original reasons might have been (I really don't see anything this would be useful for). Apologies if I'm being dim here.

Dawid

Reply via email to