IIRC the FeedParser creates sub documents from the main feed document
parsed (1 to N) whereas Tika just treats them as new links and does the
fetch + parse in subsequent step.

It is because Nutch 2.x does not support 1-to-N parse outputs that this
plugin hasn't been ported. I don't remember the exact history of this
plugin as it was in the code long before I got involved but it would be
good to get to the bottom of how it differs from parsing feeds with Tika
and decide whether it still makes sense to have it or not.

J.



On 1 March 2013 04:51, Anand Bhagwat <[email protected]> wrote:

> Thanks for quick reply.
>
> Actually I needed some plugin for ATOM feed parsing so while searching in
> the source I found FeedParser but it was giving compilation errors. Later I
> tried Tika parser and was able to parse ATOM feed. I am not sure if I am
> missing something. Basically the tika parser extracted urls and created new
> entries in the database and later when I ran fetch job again I was able to
> fetch those urls.
>
> So the question is does FeedParser provides some additional functionality
> which is missing in Tika parser? As far as I know Tika parser uses ROME
> which is well known library for parsing feeds.
>
> Regards,
> Anand.
>
> On 1 March 2013 03:38, kiran chitturi <[email protected]> wrote:
>
> > Lewis,
> >
> > On the same note, the following plugins needs to be ported when i tried
> to
> > build 2.x with Eclipse
> >
> > i)   Feed
> > ii)  parse-swf
> > iii) parse-ext
> > iv) parse-zip
> > v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478)
> >
> > The above plugins need to be ported to build 2.x successfully with
> plugins.
> >
> >
> >
> > On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney <
> > [email protected]> wrote:
> >
> > > honestly, I think we should get this fixed.
> > > Can someone please explain to me why we don't build every plugin within
> > > Nutch 2.x?
> > > I think we should.
> > >
> > >
> > > On Thu, Feb 28, 2013 at 12:58 PM, kiran chitturi
> > > <[email protected]>wrote:
> > >
> > > > This is a problem with the feed plugin. It is not yet ported to 2.x.
> > > >
> > > > The FeedIndexingFilter Class extends the IndexingFilter whose
> interface
> > > and
> > > > method changed from 1.x to 2.x
> > > >
> > > > I fixed a similar one in Parse-metaTags which extends the ParseFilter
> > > > interface.
> > > >
> > > > [Nutch-874] was opened related to these issues but we do not know
> still
> > > > what plugins need to be ported due to the API changes.
> > > >
> > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> > > >
> > > >
> > > >
> > > > On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney <
> > > > [email protected]> wrote:
> > > >
> > > > > This shouldn't be happening but we are aware (the Jira instance
> > > reflects
> > > > > this) that there are some existing compatibility issues with Nutch
> > 2.x
> > > > > HEAD.
> > > > > IIRC Kiran had a patch integrated which dealt with some of these
> > > issues.
> > > > > What I have to ask is what JDK are you using? I use 1.6.0_25 (I
> > really
> > > > need
> > > > > to upgrade) on my laptop and we run the Apache Nutch nightly builds
> > for
> > > > > both 1.x trunk and 2.x branch on the latest 1.7 version of Java.
> > > > > Unless I have broken my code whilst writing some patches, my code
> > > > compiles
> > > > > flawlessly locally and as a project we do not have regular compiler
> > > > issues
> > > > > with our development nightly builds.
> > > > >
> > > > > On Wed, Feb 27, 2013 at 10:15 PM, Anand Bhagwat <
> > [email protected]
> > > > > >wrote:
> > > > >
> > > > > > Hi,
> > > > > > I want to use FeedParser plugin which comes as part of Nutch 2.1
> > > > > > distribution. When I am trying to build it  its giving
> compilation
> > > > > errors.
> > > > > > I think its using some classes from Nutch 1.6 which are not
> > > available.
> > > > > Any
> > > > > > suggestions as to how I can resolve this issue?
> > > > > >
> > > > > >   *[javac]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:28:
> > > > > > cannot find symbol
> > > > > >     [javac] symbol  : class CrawlDatum
> > > > > >     [javac] location: package org.apache.nutch.crawl
> > > > > >     [javac] import org.apache.nutch.crawl.CrawlDatum;
> > > > > >     [javac]                              ^
> > > > > >     [javac]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:29:
> > > > > > cannot find symbol
> > > > > >     [javac] symbol  : class Inlinks
> > > > > >     [javac] location: package org.apache.nutch.crawl
> > > > > >     [javac] import org.apache.nutch.crawl.Inlinks;
> > > > > >     [javac]                              ^
> > > > > >     [javac]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /home/adminibm/Documents/workspace-sts-3.1.0.RELEASE/nutch2/src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java:36:
> > > > > > cannot find symbol
> > > > > >     [javac] symbol  : class ParseData
> > > > > >     [javac] location: package org.apache.nutch.parse
> > > > > >     [javac] import org.apache.nutch.parse.ParseData;
> > > > > >     [javac]                              ^*
> > > > > >
> > > > > > Thanks,
> > > > > > Anand.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Lewis*
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> > >
> > >
> > > --
> > > *Lewis*
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to