Hi Israel,

> [...]
> 
> Can you post your regex-urlfilter.txt file? It seems like there are invalid
> chars in there (i.e., likely something thought to be a comment with a "#" in
> front of it, but not being interpreted as a comment)

I took a look at the file. It looks fine to me. You are on windows, right?
You may need to install Cygwin, I'm not sure if Nutch works out of the box
on regular Windows.

> 
>> 
>> 3) Yes, I have the plugins:
>>      I tried to index this pages:
>>
>> [...]
> Can you show me in your nutch-default.xml where you active the parse-rss and
> feed plugin? They need to be activated in order for them to parse the RSS
> content you've mentioned in your URLs above.

I looked in your nutch-default.xml and the plugin.includes property. You
need to turn on both parse-rss and feed. Try changing the value from:

 <name>plugin.includes</name>

  
<value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|
anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opi
c|urlnormalizer-(pass|regex|basic)</value>



To

 <name>plugin.includes</name>

  
<value>protocol-http|urlfilter-regex|feed|parse-(rss|text|html|js|tika)|inde
x-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|sc
oring-opic|urlnormalizer-(pass|regex|basic)</value>

To activate the plugins.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Reply via email to