Hi Israel,

Comments inline below:

> 2) I tried with "regex-urlfilter" file and with this plugin
> 
>    2010-11-02 20:20:25,694 ERROR api.RegexURLFilterBase - Invalid first
> character: # Licensed to the Apache Software Foundation (ASF) under one or
> more
> 2010-11-02 20:20:25,698 WARN  mapred.LocalJobRunner - job_local_0001
> java.lang.RuntimeException: Error in configuring object
>     at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>     at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>     at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)

Can you post your regex-urlfilter.txt file? It seems like there are invalid
chars in there (i.e., likely something thought to be a comment with a "#" in
front of it, but not being interpreted as a comment)

> 
> 3) Yes, I have the plugins:
>      I tried to index this pages:
> 
>         http://www.edutube.org/en/taxonomy/term/7/feed
>         http://ocw.mit.edu/rss/all/mit-allcourses-21A.xml
>         http://cnx.org/lenses/cnxhcc/affiliation/atom
> 
> http://www.merlot.org/merlot/materials.xml?category=2788&materialType=&keyword
> s=&qstringrss=category%3D2788%26sort.property%3DoverallRating&sort.property=ov
> erallRating&sortbutton.x=18&sortbutton.y=7&sortbutton=Sort
> 
>     but always is the same.. I thing that I have to configure the
> "regex-urlfilter" file..... for index the page links, but not index the rss
> main page: http://www.edutube.org/en/taxonomy/term/7/feed for expample.... i
> don't know

Can you show me in your nutch-default.xml where you active the parse-rss and
feed plugin? They need to be activated in order for them to parse the RSS
content you've mentioned in your URLs above.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Reply via email to