Hi Israel, > [...] > > Can you post your regex-urlfilter.txt file? It seems like there are invalid > chars in there (i.e., likely something thought to be a comment with a "#" in > front of it, but not being interpreted as a comment)
I took a look at the file. It looks fine to me. You are on windows, right? You may need to install Cygwin, I'm not sure if Nutch works out of the box on regular Windows. > >> >> 3) Yes, I have the plugins: >> I tried to index this pages: >> >> [...] > Can you show me in your nutch-default.xml where you active the parse-rss and > feed plugin? They need to be activated in order for them to parse the RSS > content you've mentioned in your URLs above. I looked in your nutch-default.xml and the plugin.includes property. You need to turn on both parse-rss and feed. Try changing the value from: <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic| anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opi c|urlnormalizer-(pass|regex|basic)</value> To <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|feed|parse-(rss|text|html|js|tika)|inde x-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|sc oring-opic|urlnormalizer-(pass|regex|basic)</value> To activate the plugins. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

