No. I haven't used Tika. I didn't know that 2.x will not allow multiple docs from a single entry, its good that you pointed. I am using 1.5.1 for my current work where I need multiple docs from a RSS feed which generally has several outlinks. All in one parse phase of a crawl cycle.
Sourajit On Mon, Nov 12, 2012 at 2:27 PM, Julien Nioche < [email protected]> wrote: > Sourajit, > > BTW have you tried using the Tika plugin for parsing RSS feeds? The main > difference IIRC is that it will treat the links to the news items as normal > outlinks and will then fetch them whereas the feed parser generates N-sub > documents directly from the feed. This is actually why we don't have it in > Nutch 2.x as the parsing does not allow to have N docs from a single entry. > > Julien > > On 9 November 2012 11:34, Sourajit Basak <[email protected]> wrote: > > > https://issues.apache.org/jira/browse/NUTCH-1494 > > > > On Fri, Nov 9, 2012 at 4:56 PM, Lewis John Mcgibbney < > > [email protected]> wrote: > > > > > Hi, > > > > > > Can you please open an issue for this. I can confirm that without > > > adding some additional dependencies I get the following when > > > attempting to parse an rss feed [0] which I have saved locally. > > > > > > lewis@lewis-desktop:~/ASF/trunk/runtime/local$ ./bin/nutch plugin feed > > > org.apache.nutch.parse.feed.FeedParser latest.xmlException in thread > > > "main" java.lang.reflect.InvocationTargetException > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > at > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > at java.lang.reflect.Method.invoke(Method.java:597) > > > at > > > > org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421) > > > Caused by: java.lang.NoClassDefFoundError: > > > com/sun/syndication/io/SyndFeedInput > > > at > > > org.apache.nutch.parse.feed.FeedParser.getParse(FeedParser.java:117) > > > at > > org.apache.nutch.parse.feed.FeedParser.main(FeedParser.java:211) > > > ... 5 more > > > Caused by: java.lang.ClassNotFoundException: > > > com.sun.syndication.io.SyndFeedInput > > > at java.net.URLClassLoader$1.run(URLClassLoader.java:202) > > > at java.security.AccessController.doPrivileged(Native Method) > > > at java.net.URLClassLoader.findClass(URLClassLoader.java:190) > > > at java.lang.ClassLoader.loadClass(ClassLoader.java:307) > > > at java.lang.ClassLoader.loadClass(ClassLoader.java:248) > > > ... 7 more > > > > > > > > > [0] http://www.scotland.gov.uk/rss/publications/latest.xml > > > > > > > > > > > > On Fri, Nov 9, 2012 at 10:55 AM, Sourajit Basak > > > <[email protected]> wrote: > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >

