[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v5.patch
No, the double call to getLegalXml is not intentional. Its a mistake. Thanks
for finding it.
I've attached
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v4.patch
v3 mistakenly included debugging code.
Attached cleaned up v4.
OpenSearchServlet outputs illegal xml
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars08-v3.patch
Version of patch that doesn't ...process the String twice if it contains some
illegal characters!. Its name
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Version: 0.8-dev
(was: 0.7)
Was version 0.7. Changed 'Affects Version' to 0.8-dev.
OpenSearchServlet outputs illegal xml characters
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
Stefan Neufeind updated NUTCH-110:
--
Attachment: fixIllegalXmlChars08.patch
Since original patch didn't cleanly apply for me on 0.8-dev
(nightly-2006-05-20) I re-did it for 0.8 ...
With this
: (NUTCH-110) OpenSearchServlet outputs illegal
xml characters
...
So, will I amend the patch in NUTCH-110 so it uses
XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
Copy the method's contents. It doesn't really make sense to copy the
entire class just for this method. Good luck
Dawid Weiss wrote:
...
So, will I amend the patch in NUTCH-110 so it uses
XMLSerializerHelper#toValidXmlText in place of #getLegalXml method?
Copy the method's contents. It doesn't really make sense to copy the
entire class just for this method. Good luck.
Thanks Dawid.
I've just
Chris Mattmann wrote:
Hi,
I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a !CDATA[ ]! tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?
If the CDATA method won't work,
Dawid Weiss wrote:
We should not drop the offending characters, but escape them. Either
the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler).
This isn't entirely true, Andrzej -- escaping a character, or putting it
in a CDATA section is just about different ways of
Right, I didn't think about this... somehow I thought this was all about
special characters like ' .
Oh, believe me: this knowledge came from sour experience not from book
wisdom... I know for sure some XML parsers complain about invalid
characters, while others don't.
Then we should
Andrzej Bialecki wrote:
Then we should take the best of both worlds - escape valid characters,
and replace invalid ones with '?' or space, or nothing. I know a place
where we could find some inspiration (Carrot2 XMLSerializerHelper.java
... ;-) )
Thanks for the pointer. See starting
The differences between this method and the patch supplied in NUTCH-110
are:
Take a closer look at the source code --
1. XMLSerializerHelper#toValidXmlText throws an exception when an
invalid character whereas NUTCH-110 just drops it.
Not really, it is governed by a boolean flag. If
[ http://issues.apache.org/jira/browse/NUTCH-110?page=all ]
[EMAIL PROTECTED] updated NUTCH-110:
Attachment: fixIllegalXmlChars.patch
Attached patch runs all xml text through a check for bad xml characters. This
patch is brutal dropping silently
, or the California Institute of Technology.
-Original Message-
From: [EMAIL PROTECTED] (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 12, 2005 5:19 PM
To: nutch-dev@incubator.apache.org
Subject: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml
characters
14 matches
Mail list logo