Hello,

I am using OpenSearch API (RSS) to access Nutch and sometimes I
receive content which renders HTML controls such as "input".

Attached is a sample of XML generated by OpenSearch API. The sample
includes an item which will render the input control when viewed in
Firefox. The same happens when the search results are passed on to
other systems and displayed as HTML.

Below is a fragment from the attached file with the problematic
content. I think that parsing of the document snippet should improved
in Nutch to fix this.

<item>
<title>var thisDomain = document.domain;</title>
<description>&lt;span class="ellipsis"&gt; ...
&lt;/span&gt;searchForm' action='http://search.money.&lt;span
class="highlight"&gt;cnn&lt;/span&gt;.com/pages/search.jsp'&gt;&lt;input&lt;span
class="ellipsis"&gt; ... &lt;/span&gt;id='search_button'
SRC='http://i.l.&lt;span
class="highlight"&gt;cnn&lt;/span&gt;.net/money/.element/img/1&lt;span
class="ellipsis"&gt; ... &lt;/span&gt;</description>

<link>http://i.cnn.net/money/.element/ssi/javascript/1.1/cnnhat_section.js</link>
<nutch:site>i.cnn.net</nutch:site>
<nutch:cache>http://192.168.2.204:8080/nutch/cached.jsp?idx=0&amp;id=135397</nutch:cache>
<nutch:explain>http://192.168.2.204:8080/nutch/explain.jsp?idx=0&amp;id=135397&amp;query=cnn&amp;lang=en</nutch:explain>
<nutch:lastModified>1239773334000</nutch:lastModified>
<nutch:segment>20110103001242</nutch:segment>
<nutch:digest>773bc801c969e25fe331d7d36feaa05b</nutch:digest>

<nutch:tstamp>20101212072027490</nutch:tstamp>
<nutch:boost>2.677803</nutch:boost>
<nutch:contentLength>6106</nutch:contentLength>
</item>

Thanks,
Yavinty

Attachment: opensearch.xhtml
Description: application/xhtml

Reply via email to