I use Nutch version 1.1 (Released 06 June 2010).
I didn't install any additional plugin!
I think your xml-plugin at NUTCH-185 is outdated:
"Resolution:Won't Fix" and "Affects Version/s: 0.7.2, 0.8,
0.8.1".
Check your nutch version (and update).
Check in "nutch-site.xml" at "<name>plugin.includes</name>"
if parse-tika is available.
"...parse-(text|html|js|tika)...".
If "nutch-site.xml" is empty because you don't use it(?).
Check parse-tika in "nutch-default.xml" instead.
-------------------------
-------------------------
TESTING and **MY** BEST MATCHES (maybe some other guys out
there have better ones):
-------------------------
-------------------------
I've tested your links for several hours. This is my whole
journal including all failures, too.
Forget my last post concerning changes in
"parse-plugins.xml" and "nutch-site.xml"!
You'll find three times "==>TODOx<==" and "==>TODOx-end<==".
1)
- http://cnx.org/lenses/ccotp/endorsements/atom:
contentType=application/xml.
Crawled as Nutch 1.1 is.
NOTHING CHANGED IN *.XML-FILES.
SUCCESS, BUT: There is HTML-Source-Code in some search
summaries (like "<p>" or "<a>")?
Checked the source code of the page. Lots of entities inside
SUBTITLE-Tag, declared as type="text/html" => Not a
parser-fault, I think!
2)
nutch-site.xml: removed parser-tika:
Error: parser not found for contentType=application/xml
3)
parse-plugins.xml: added:
<mimeType name="application/xml">
<plugin id="parse-html" />
</mimeType>
Like parser-tika did it. Not better.
4)
parse-plugins.xml: changed
<mimeType name="application/xml">
<plugin id="parse-rss" />
</mimeType>
nutch-site.xml: added parser-rss:
No errors but no search results (empty).
5)
parse-plugins.xml: changed
<mimeType name="application/xml">
<plugin id="feed" />
</mimeType>
nutch-site.xml: added feed:
Errors, errors,errors.
======> *MY* BEST MATCH:
Page - http://cnx.org/lenses/ccotp/endorsements:
contentType=application/xml.
==>TODO1<==
Nothing.
Crawl as is. parse-tika did it.
==>TODO1-end<==
---------------------------------------------------------
1)
- http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
mime-type application/rss+xml
Crawl as is. parse-tika.
Error: Can't retrieve Tika parser for mime-type
application/rss+xml.
Makes me wonder because I found "application/rss+xml" in
tika-mimetypes.xml.
2)
Found in parse-plugins.xml:
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
nutch-site.xml: added parse-rss but not feed:
Error: Can't be handled as rss document.
org.apache.commons.feedparser.FeedParserException:
org.jdom.input.JDOMParseException: Error on line 768: The
element type "dc:creator" must be terminated by the matching
end-tag "</dc:creator>".
nutch-site.xml: removed parse-rss and added feed:
Error: dito
This means: Parsers parse-rss or feed would be right if the
page wouldn't be corrupt! Is it? I checked the source code
with Firefox and couldn't find any error!!!!!!!!!!!!!
Line 768: <dc:creator>The Open University</dc:creator>
looks fine. Strange!
3)
Changed in parse-plugins.xml:
<mimeType name="application/rss+xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
SUCCESS!
======> *MY* BEST MATCH:
Page - http://openlearn.open.ac.uk/file.php/1/learningspace.xml:
mime-type application/rss+xml.
==>TODO2<==
In conf/parse-plugins.xml:
--FIND:
<mimeType name="application/rss+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
--REPLACE WITH:
<mimeType name="application/rss+xml">
<plugin id="parse-html" /><!--subsequently added. parse-rss
and feed throw unreproducible error. thread msg00666.html et
seqq.-->
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
==>TODO2-end<==
---------------------------------------------------------
1)
- http://ocw.nd.edu/courselist/rss:
mime-type application/rdf+xml
Crawl as is. parse-tika.
Error: Can't retrieve Tika parser for mime-type
application/rdf+xml.
Makes me wonder because I found "application/rdf+xml" in
tika-mimetypes.xml.
2)
nutch-site.xml: added parse-rss but not feed.
parse-plugins.xml: added:
<mimeType name="application/rdf+xml">
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
Searched for normal text: SUCCESS!
Searched for a title (displayed as anchors): No search result!
3)
<mimeType name="application/rdf+xml">
<plugin id="parse-html" />
</mimeType>
Searched for normal text: SUCCESS!
Searched for a title (displayed as anchors): SUCCESS!
======> *MY* BEST MATCH:
Page - http://ocw.nd.edu/courselist/rss:
mime-type application/rdf+xml.
==>TODO3<==
In conf/parse-plugins.xml:
--FIND:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
--REPLACE WITH:
<mimeType name="text/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
<mimeType name="application/rdf+xml"><!--subsequently added.
parse-tika throws error. thread msg00666.html et seqq.-->
<plugin id="parse-html" />
</mimeType>
==>TODO3-end<==
---------------------------------------------------------
Am 21.08.2010 20:43, schrieb Israel:
to put this:
Added to "parse-plugins.xml"
<mimeType name="application/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
2010/8/21 Israel<[email protected]>
2010/8/21 Israel<[email protected]>
Thanks for your help, plese help me with this
Hello, i download the parse plugin from: "
https://issues.apache.org/jira/browse/NUTCH-185", and i don't know where
put this:
Added to "parse-plugins.xml"
<mimeType name="application/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
to put this:
Added to "parse-plugins.xml"
<mimeType name="application/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>