1. what version of Nutch/JDK/OS are you using?
2. do you have some log information that you can show to determine if the
parse-rss or feed plugin is being called?
3. Have you activated those plugins in your nutch-default.xml conf file?
Let me know on 1-3 and then maybe I can help more.
1) I have nutch 1.2, windows 7 ultimate and java 1.6.0_21
2) I tried with "regex-urlfilter" file and with this plugin
2010-11-02 20:20:25,694 ERROR api.RegexURLFilterBase - Invalid first
character: # Licensed to the Apache Software Foundation (ASF) under one or
more
2010-11-02 20:20:25,698 WARN mapred.LocalJobRunner - job_local_0001
java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
3) Yes, I have the plugins:
I tried to index this pages:
http://www.edutube.org/en/taxonomy/term/7/feed
http://ocw.mit.edu/rss/all/mit-allcourses-21A.xml
http://cnx.org/lenses/cnxhcc/affiliation/atom
http://www.merlot.org/merlot/materials.xml?category=2788&materialType=&keywords=&qstringrss=category%3D2788%26sort.property%3DoverallRating&sort.property=overallRating&sortbutton.x=18&sortbutton.y=7&sortbutton=Sort
but always is the same.. I thing that I have to configure the
"regex-urlfilter" file..... for index the page links, but not index the rss
main page: http://www.edutube.org/en/taxonomy/term/7/feed for expample.... i
don't know