Nutch 1.1.

I tested just with "http://cnx.org/lenses/ccotp/endorsements/atom";

I added to property "plugin.includes" in "nutch-site.xml"

"...parse-(text|html|js|tika|pdf|rss)|feed|..."

(see added "rss" and "feed"; I don't know which one did it).

Added to "parse-plugins.xml"

<mimeType name="application/xml">

<plugin id="parse-html" />

<plugin id="parse-rss" />

<plugin id="feed" />

</mimeType>

and to "regex-urlfilter.txt"

"

+^http://cnx.org/lenses/ccotp/endorsements/atom

# skip everything else

-.

"

---------
If you use the runbot-script at
"http://wiki.apache.org/nutch/Crawl":

Created a directory "urls" added a text-file with

"http://cnx.org/lenses/ccotp/endorsements/atom"; in it.

configured the runbot-script and started the script with

"sh runbot"

 and got the page indexed.


Am 21.08.2010 01:31, schrieb Israel:
Hello, I tried to indexer these pages that use xml, rss, atom or inclusive
rdf or the respective format ..... but errors occur, I download the "parse
xml " plugin but I don't how to use this.

I index this pages:

http://cnx.org/lenses/ccotp/endorsements/atom
http://ocw.nd.edu/courselist/rss
http://openlearn.open.ac.uk/file.php/1/learningspace.xml

I need any plugin? I tried with rss and feed... and how do I configure the
files "crawl-urlfilter" *. txt and seed (web addresses ).... if I could
please send to my mail if you have some plugin .... Thank you.

I've searched hours and hours in the web...and I don't have answer

Reply via email to