Nutch 1.1.
I tested just with
"http://cnx.org/lenses/ccotp/endorsements/atom"
I added to property "plugin.includes" in "nutch-site.xml"
"...parse-(text|html|js|tika|pdf|rss)|feed|..."
(see added "rss" and "feed"; I don't know which one did it).
Added to "parse-plugins.xml"
<mimeType name="application/xml">
<plugin id="parse-html" />
<plugin id="parse-rss" />
<plugin id="feed" />
</mimeType>
and to "regex-urlfilter.txt"
"
+^http://cnx.org/lenses/ccotp/endorsements/atom
# skip everything else
-.
"
---------
If you use the runbot-script at
"http://wiki.apache.org/nutch/Crawl":
Created a directory "urls" added a text-file with
"http://cnx.org/lenses/ccotp/endorsements/atom" in it.
configured the runbot-script and started the script with
"sh runbot"
and got the page indexed.
Am 21.08.2010 01:31, schrieb Israel:
Hello, I tried to indexer these pages that use xml, rss, atom or inclusive
rdf or the respective format ..... but errors occur, I download the "parse
xml " plugin but I don't how to use this.
I index this pages:
http://cnx.org/lenses/ccotp/endorsements/atom
http://ocw.nd.edu/courselist/rss
http://openlearn.open.ac.uk/file.php/1/learningspace.xml
I need any plugin? I tried with rss and feed... and how do I configure the
files "crawl-urlfilter" *. txt and seed (web addresses ).... if I could
please send to my mail if you have some plugin .... Thank you.
I've searched hours and hours in the web...and I don't have answer