No. I haven't used Tika.

I didn't know that 2.x will not allow multiple docs from a single entry,
its good that you pointed. I am using 1.5.1 for my current work where I
need multiple docs from a RSS feed which generally has several outlinks.
All in one parse phase of a crawl cycle.

Sourajit

On Mon, Nov 12, 2012 at 2:27 PM, Julien Nioche <
[email protected]> wrote:

> Sourajit,
>
> BTW have you tried using the Tika plugin for parsing RSS feeds? The main
> difference IIRC is that it will treat the links to the news items as normal
> outlinks and will then fetch them whereas the feed parser generates N-sub
> documents directly from the feed. This is actually why we don't have it in
> Nutch 2.x as the parsing does not allow to have N docs from a single entry.
>
> Julien
>
> On 9 November 2012 11:34, Sourajit Basak <[email protected]> wrote:
>
> > https://issues.apache.org/jira/browse/NUTCH-1494
> >
> > On Fri, Nov 9, 2012 at 4:56 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> > > Hi,
> > >
> > > Can you please open an issue for this. I can confirm that without
> > > adding some additional dependencies I get the following when
> > > attempting to parse an rss feed [0] which I have saved locally.
> > >
> > > lewis@lewis-desktop:~/ASF/trunk/runtime/local$ ./bin/nutch plugin feed
> > > org.apache.nutch.parse.feed.FeedParser latest.xmlException in thread
> > > "main" java.lang.reflect.InvocationTargetException
> > >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > >         at
> > >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > >         at
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > >         at java.lang.reflect.Method.invoke(Method.java:597)
> > >         at
> > >
> org.apache.nutch.plugin.PluginRepository.main(PluginRepository.java:421)
> > > Caused by: java.lang.NoClassDefFoundError:
> > > com/sun/syndication/io/SyndFeedInput
> > >         at
> > > org.apache.nutch.parse.feed.FeedParser.getParse(FeedParser.java:117)
> > >         at
> > org.apache.nutch.parse.feed.FeedParser.main(FeedParser.java:211)
> > >         ... 5 more
> > > Caused by: java.lang.ClassNotFoundException:
> > > com.sun.syndication.io.SyndFeedInput
> > >         at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> > >         at java.security.AccessController.doPrivileged(Native Method)
> > >         at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> > >         at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
> > >         at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
> > >         ... 7 more
> > >
> > >
> > > [0] http://www.scotland.gov.uk/rss/publications/latest.xml
> > >
> > >
> > >
> > > On Fri, Nov 9, 2012 at 10:55 AM, Sourajit Basak
> > > <[email protected]> wrote:
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Reply via email to