Hi Israel,

OK, I checked out your link. Let me see if I can summarize what you'd like 
Nutch to do:

1. you're using parse-rss
2. you notice that parse-rss adds *both* the RSS file link *and* the inner 
links within the RSS file to the search results
3. instead of doing #2, you'd like parse-rss to *only* add the links within the 
RSS files
4. to achieve #3, you're messing around with the regex-urlfilter to solve this

Is 1-4 a good summary of the issue? If so, then here's the larger problem. 
Messing around with the regex-urlfilter with RSS feeds is difficult because you 
can't really filter out e.g., files just based on .rss extensions, b/c a lot of 
these feeds are being generated by web services (including those within your 
example).

So, you may need to modify parse-rss. Check out 
$NUTCH_HOME/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java
 and look for the line:

 if (r.getLink() != null) {
                    try {
//...

Right underneath there it adds the link for the actual RSS channel itself 
(which is usually the link to the RSS file). I think if you remove this, and 
then recompile the plugin (by going to $NUTCH_HOME/src/plugin/parse-rss/ and 
then typing ant clean ; ant) you can try that out.

HTH!

Cheers,
Chris


On Nov 19, 2010, at 6:56 AM, Israel wrote:

> Hello, I did this to explain my problem, please help me:
> 
> http://www.box.net/shared/qyqnk38x25


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to