Chris Mattmann wrote:
Hi,
I'm not an XML expert by any means, but wouldn't it be simpler to just wrap
any text where illegal chars are possible with a !CDATA[ ]! tag? That
way, the offending characters won't be dropped and the process won't be
lossy, no?
If the CDATA method won't work,
Dawid Weiss wrote:
We should not drop the offending characters, but escape them. Either
the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler).
This isn't entirely true, Andrzej -- escaping a character, or putting it
in a CDATA section is just about different ways of
Song Han wrote:
I copy and paste Chinese keyword into the search box, and the returned
results look messy and are not readable.
Anyone knows how to support keyword input and result ouput in Chinese?
If you are using tomcat, then you are probably missing
useBodyEncodingForURI=true in your
Right, I didn't think about this... somehow I thought this was all about
special characters like ' .
Oh, believe me: this knowledge came from sour experience not from book
wisdom... I know for sure some XML parsers complain about invalid
characters, while others don't.
Then we should
Looking back on this post, perhaps it might be better suited to the
developer list...
Thoughts anyone?
- Jeff
From: Dalton, Jeffery
Sent: Wednesday, October 12, 2005 4:01 PM
To: nutch-user@lucene.apache.org
Subject: NutchAnalysis -- Distinguishing between
Andrzej Bialecki wrote:
Then we should take the best of both worlds - escape valid characters,
and replace invalid ones with '?' or space, or nothing. I know a place
where we could find some inspiration (Carrot2 XMLSerializerHelper.java
... ;-) )
Thanks for the pointer. See starting
The differences between this method and the patch supplied in NUTCH-110
are:
Take a closer look at the source code --
1. XMLSerializerHelper#toValidXmlText throws an exception when an
invalid character whereas NUTCH-110 just drops it.
Not really, it is governed by a boolean flag. If
Okay. All nutch downloads should now be through mirrors.
The web site now refers to downloads through the url:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/
The former download urls now redirect to the appropriate places:
http://lucene.apache.org/lucene/nutch/release/
Doug Cutting wrote:
Okay. All nutch downloads should now be through mirrors.
The web site now refers to downloads through the url:
http://www.apache.org/dyn/closer.cgi/lucene/nutch/
The former download urls now redirect to the appropriate places:
This patch is for comments/local name change or error
msg change only, to clarify the code as it relates to
the new JUnit test, TestNDFS that I wrote. In many
cases the intention was to help the next person reading
the source. These changes should be very safe and
are unlikely to introduce
10 matches
Mail list logo