I use Nutch version 1.1 (Released 06 June 2010).

I didn't install any additional plugin!

I think your xml-plugin at NUTCH-185 is outdated: "Resolution:Won't Fix" and "Affects Version/s: 0.7.2, 0.8, 0.8.1".

Check your nutch version (and update).

Check in "nutch-site.xml" at "<name>plugin.includes</name>" if parse-tika is available.
"...parse-(text|html|js|tika)...".

If "nutch-site.xml" is empty because you don't use it(?). Check parse-tika in "nutch-default.xml" instead.

-------------------------
-------------------------
TESTING and **MY** BEST MATCHES (maybe some other guys out there have better ones):
-------------------------
-------------------------

I've tested your links for several hours. This is my whole journal including all failures, too.

Forget my last post concerning changes in "parse-plugins.xml" and "nutch-site.xml"!

You'll find three times "==>TODOx<==" and  "==>TODOx-end<==".

1)
- http://cnx.org/lenses/ccotp/endorsements/atom:
contentType=application/xml.

Crawled as Nutch 1.1 is.
NOTHING CHANGED IN *.XML-FILES.
SUCCESS, BUT: There is HTML-Source-Code in some search summaries (like "<p>" or "<a>")? Checked the source code of the page. Lots of entities inside SUBTITLE-Tag, declared as type="text/html" => Not a parser-fault, I think!

2)
nutch-site.xml: removed parser-tika:
Error: parser not found for contentType=application/xml

3)
parse-plugins.xml: added:
<mimeType name="application/xml">
<plugin id="parse-html" />
</mimeType>
Like parser-tika did it. Not better.

4)
parse-plugins.xml: changed
<mimeType name="application/xml">
<plugin id="parse-rss" />
</mimeType>
nutch-site.xml: added parser-rss:
No errors but no search results (empty).

5)
parse-plugins.xml: changed
<mimeType name="application/xml">
<plugin id="feed" />
</mimeType>
nutch-site.xml: added feed:
Errors, errors,errors.

======> *MY* BEST MATCH:
Page - http://cnx.org/lenses/ccotp/endorsements:
contentType=application/xml.

==>TODO1<==
Nothing.
Crawl as is. parse-tika did it.
==>TODO1-end<==
---------------------------------------------------------

1)
- http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
mime-type application/rss+xml
Crawl as is. parse-tika.
Error: Can't retrieve Tika parser for mime-type application/rss+xml. Makes me wonder because I found "application/rss+xml" in tika-mimetypes.xml.

2)
Found in parse-plugins.xml:
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
nutch-site.xml: added parse-rss but not feed:
Error: Can't be handled as rss document. org.apache.commons.feedparser.FeedParserException: org.jdom.input.JDOMParseException: Error on line 768: The element type "dc:creator" must be terminated by the matching end-tag "</dc:creator>".

nutch-site.xml: removed parse-rss and added feed:
Error: dito

This means: Parsers parse-rss or feed would be right if the page wouldn't be corrupt! Is it? I checked the source code with Firefox and couldn't find any error!!!!!!!!!!!!!
Line 768: <dc:creator>The Open University</dc:creator>
looks fine. Strange!

3)
Changed in parse-plugins.xml:
<mimeType name="application/rss+xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
SUCCESS!

======> *MY* BEST MATCH:
Page - http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
mime-type application/rss+xml.

==>TODO2<==
In conf/parse-plugins.xml:

--FIND:
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>

--REPLACE WITH:
<mimeType name="application/rss+xml">
<plugin id="parse-html" /><!--subsequently added. parse-rss and feed throw unreproducible error. thread msg00666.html et seqq.-->
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
==>TODO2-end<==
---------------------------------------------------------

1)
- http://ocw.nd.edu/courselist/rss:
mime-type application/rdf+xml

Crawl as is. parse-tika.
Error: Can't retrieve Tika parser for mime-type application/rdf+xml. Makes me wonder because I found "application/rdf+xml" in tika-mimetypes.xml.

2)
nutch-site.xml: added parse-rss but not feed.
parse-plugins.xml: added:
<mimeType name="application/rdf+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
Searched for normal text: SUCCESS!
Searched for a title (displayed as anchors): No search result!

3)
<mimeType name="application/rdf+xml">
<plugin id="parse-html" />
</mimeType>
Searched for normal text: SUCCESS!
Searched for a title (displayed as anchors): SUCCESS!

======> *MY* BEST MATCH:
Page - http://ocw.nd.edu/courselist/rss:
mime-type application/rdf+xml.

==>TODO3<==
In conf/parse-plugins.xml:

--FIND:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>

--REPLACE WITH:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/rdf+xml"><!--subsequently added. parse-tika throws error. thread msg00666.html et seqq.-->
<plugin id="parse-html" />
</mimeType>
==>TODO3-end<==
---------------------------------------------------------




Am 21.08.2010 20:43, schrieb Israel:
to put this:
Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>

2010/8/21 Israel<[email protected]>



2010/8/21 Israel<[email protected]>


Thanks for your help, plese help me with this

Hello, i download the parse plugin from: "
https://issues.apache.org/jira/browse/NUTCH-185";, and i don't know where
put this:


Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>





to put this:

Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>

Reply via email to