Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Andrzej Bialecki
Chris Mattmann wrote: Hi, I'm not an XML expert by any means, but wouldn't it be simpler to just wrap any text where illegal chars are possible with a !CDATA[ ]! tag? That way, the offending characters won't be dropped and the process won't be lossy, no? If the CDATA method won't work,

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Andrzej Bialecki
Dawid Weiss wrote: We should not drop the offending characters, but escape them. Either the Unicode entity (#nn;) or CDATA way is ok (and CDATA way is simpler). This isn't entirely true, Andrzej -- escaping a character, or putting it in a CDATA section is just about different ways of

Re: Enter Chinese in search box, returns messy results

2005-10-13 Thread Andrzej Bialecki
Song Han wrote: I copy and paste Chinese keyword into the search box, and the returned results look messy and are not readable. Anyone knows how to support keyword input and result ouput in Chinese? If you are using tomcat, then you are probably missing useBodyEncodingForURI=true in your

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss
Right, I didn't think about this... somehow I thought this was all about special characters like ' . Oh, believe me: this knowledge came from sour experience not from book wisdom... I know for sure some XML parsers complain about invalid characters, while others don't. Then we should

NutchAnalysis -- Distinguishing between quoted clauses (phrases) and unquoted clauses (individual terms) after parsing

2005-10-13 Thread Dalton, Jeffery
Looking back on this post, perhaps it might be better suited to the developer list... Thoughts anyone? - Jeff From: Dalton, Jeffery Sent: Wednesday, October 12, 2005 4:01 PM To: nutch-user@lucene.apache.org Subject: NutchAnalysis -- Distinguishing between

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread stack
Andrzej Bialecki wrote: Then we should take the best of both worlds - escape valid characters, and replace invalid ones with '?' or space, or nothing. I know a place where we could find some inspiration (Carrot2 XMLSerializerHelper.java ... ;-) ) Thanks for the pointer. See starting

Re: [jira] Updated: (NUTCH-110) OpenSearchServlet outputs illegal xml characters

2005-10-13 Thread Dawid Weiss
The differences between this method and the patch supplied in NUTCH-110 are: Take a closer look at the source code -- 1. XMLSerializerHelper#toValidXmlText throws an exception when an invalid character whereas NUTCH-110 just drops it. Not really, it is governed by a boolean flag. If

Re: nutch downloads

2005-10-13 Thread Doug Cutting
Okay. All nutch downloads should now be through mirrors. The web site now refers to downloads through the url: http://www.apache.org/dyn/closer.cgi/lucene/nutch/ The former download urls now redirect to the appropriate places: http://lucene.apache.org/lucene/nutch/release/

Re: nutch downloads

2005-10-13 Thread Joshua Slive
Doug Cutting wrote: Okay. All nutch downloads should now be through mirrors. The web site now refers to downloads through the url: http://www.apache.org/dyn/closer.cgi/lucene/nutch/ The former download urls now redirect to the appropriate places:

patch for changes related to TestNDFS

2005-10-13 Thread Paul Baclace
This patch is for comments/local name change or error msg change only, to clarify the code as it relates to the new JUnit test, TestNDFS that I wrote. In many cases the intention was to help the next person reading the source. These changes should be very safe and are unlikely to introduce