Hi Israel, Comments inline below:
> 2) I tried with "regex-urlfilter" file and with this plugin > > 2010-11-02 20:20:25,694 ERROR api.RegexURLFilterBase - Invalid first > character: # Licensed to the Apache Software Foundation (ASF) under one or > more > 2010-11-02 20:20:25,698 WARN mapred.LocalJobRunner - job_local_0001 > java.lang.RuntimeException: Error in configuring object > at > org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) Can you post your regex-urlfilter.txt file? It seems like there are invalid chars in there (i.e., likely something thought to be a comment with a "#" in front of it, but not being interpreted as a comment) > > 3) Yes, I have the plugins: > I tried to index this pages: > > http://www.edutube.org/en/taxonomy/term/7/feed > http://ocw.mit.edu/rss/all/mit-allcourses-21A.xml > http://cnx.org/lenses/cnxhcc/affiliation/atom > > http://www.merlot.org/merlot/materials.xml?category=2788&materialType=&keyword > s=&qstringrss=category%3D2788%26sort.property%3DoverallRating&sort.property=ov > erallRating&sortbutton.x=18&sortbutton.y=7&sortbutton=Sort > > but always is the same.. I thing that I have to configure the > "regex-urlfilter" file..... for index the page links, but not index the rss > main page: http://www.edutube.org/en/taxonomy/term/7/feed for expample.... i > don't know Can you show me in your nutch-default.xml where you active the parse-rss and feed plugin? They need to be activated in order for them to parse the RSS content you've mentioned in your URLs above. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

