Hello, I am using OpenSearch API (RSS) to access Nutch and sometimes I receive content which renders HTML controls such as "input".
Attached is a sample of XML generated by OpenSearch API. The sample includes an item which will render the input control when viewed in Firefox. The same happens when the search results are passed on to other systems and displayed as HTML. Below is a fragment from the attached file with the problematic content. I think that parsing of the document snippet should improved in Nutch to fix this. <item> <title>var thisDomain = document.domain;</title> <description><span class="ellipsis"> ... </span>searchForm' action='http://search.money.<span class="highlight">cnn</span>.com/pages/search.jsp'><input<span class="ellipsis"> ... </span>id='search_button' SRC='http://i.l.<span class="highlight">cnn</span>.net/money/.element/img/1<span class="ellipsis"> ... </span></description> <link>http://i.cnn.net/money/.element/ssi/javascript/1.1/cnnhat_section.js</link> <nutch:site>i.cnn.net</nutch:site> <nutch:cache>http://192.168.2.204:8080/nutch/cached.jsp?idx=0&id=135397</nutch:cache> <nutch:explain>http://192.168.2.204:8080/nutch/explain.jsp?idx=0&id=135397&query=cnn&lang=en</nutch:explain> <nutch:lastModified>1239773334000</nutch:lastModified> <nutch:segment>20110103001242</nutch:segment> <nutch:digest>773bc801c969e25fe331d7d36feaa05b</nutch:digest> <nutch:tstamp>20101212072027490</nutch:tstamp> <nutch:boost>2.677803</nutch:boost> <nutch:contentLength>6106</nutch:contentLength> </item> Thanks, Yavinty
opensearch.xhtml
Description: application/xhtml

