Re: Problem compiling FeedParser plugin with Nutch 2.1 source

Jorge Luis Betancourt Gonzalez Sat, 02 Mar 2013 20:59:00 -0800

MY bad I thought that the FeedParser refers to nutch 1.X plugin found in the 
trunk of svn :-) I'm inquiring into this because I'm trying to parse a bunch of 
PDF files, but store each page individually, though 1 to N relation between 1 
url and a lot of documents into solr, any advice on this subject? Right now I'm 
thinking in something like 
http://sujitpal.blogspot.com/2012/02/nutchgora-indexing-sections-and.html.


What do you think?

----- Mensaje original -----
De: "Lewis John Mcgibbney" <[email protected]>
Para: [email protected]
Enviados: Sábado, 2 de Marzo 2013 23:16:23
Asunto: Re: Problem compiling FeedParser plugin with Nutch 2.1 source

Hi Jorge,
Afaik it isn't.
We're talking about w.x here

On Saturday, March 2, 2013, Jorge Luis Betancourt Gonzalez <
[email protected]> wrote:
> How does the subdocuments get indexed into solr? I've thought that the 1
to N wasn't possible with nutch 1.X.
>
> ----- Mensaje original -----
> De: "Julien Nioche" <[email protected]>
> Para: [email protected]
> Enviados: Sábado, 2 de Marzo 2013 3:27:35
> Asunto: Re: Problem compiling FeedParser plugin with Nutch 2.1 source
>
> IIRC the FeedParser creates sub documents from the main feed document
> parsed (1 to N) whereas Tika just treats them as new links and does the
> fetch + parse in subsequent step.
>
> It is because Nutch 2.x does not support 1-to-N parse outputs that this
> plugin hasn't been ported. I don't remember the exact history of this
> plugin as it was in the code long before I got involved but it would be
> good to get to the bottom of how it differs from parsing feeds with Tika
> and decide whether it still makes sense to have it or not.
>
> J.
>
>
>
> On 1 March 2013 04:51, Anand Bhagwat <[email protected]> wrote:
>
>> Thanks for quick reply.
>>
>> Actually I needed some plugin for ATOM feed parsing so while searching in
>> the source I found FeedParser but it was giving compilation errors.
Later I
>> tried Tika parser and was able to parse ATOM feed. I am not sure if I am
>> missing something. Basically the tika parser extracted urls and created
new
>> entries in the database and later when I ran fetch job again I was able
to
>> fetch those urls.
>>
>> So the question is does FeedParser provides some additional functionality
>> which is missing in Tika parser? As far as I know Tika parser uses ROME
>> which is well known library for parsing feeds.
>>
>> Regards,
>> Anand.
>>
>> On 1 March 2013 03:38, kiran chitturi <[email protected]> wrote:
>>
>> > Lewis,
>> >
>> > On the same note, the following plugins needs to be ported when i tried
>> to
>> > build 2.x with Eclipse
>> >
>> > i)   Feed
>> > ii)  parse-swf
>> > iii) parse-ext
>> > iv) parse-zip
>> > v) parse-metatags ( I wrote patch for this earlier, NUTCH-1478)
>> >
>> > The above plugins need to be ported to build 2.x successfully with
>> plugins.
>> >
>> >
>> >
>> > On Thu, Feb 28, 2013 at 4:58 PM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> > > honestly, I think we should get this fixed.
>> > > Can someone please explain to me why we don't build every plugin
within
>> > > Nutch 2.x?
>> > > I think we should.
>> > >
>> > >
>> > > On Thu, Feb 28, 2013 at 12:58 PM, kiran chitturi
>> > > <[email protected]>wrote:
>> > >
>> > > > This is a problem with the feed plugin. It is not yet ported to
2.x.
>> > > >
>> > > > The FeedIndexingFilter Class extends the IndexingFilter whose
>> interface
>> > > and
>> > > > method changed from 1.x to 2.x
>> > > >
>> > > > I fixed a similar one in Parse-metaTags which extends the
ParseFilter
>> > > > interface.
>> > > >
>> > > > [Nutch-874] was opened related to these issues but we do not know
>> still
>> > > > what plugins need to be ported due to the API changes.
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Feb 28, 2013 at 3:26 PM, Lewis John Mcgibbney <
>> > > > [email protected]> wrote:
>> > > >
>> > > > > This shouldn't be happening but we are aware (the Jira instance
>> > > reflects
>> > > > > this) that there are some existing compatibility issues with
Nutch
>> > 2.x
>> > > > > HEAD.
>> > > > > IIRC Kiran had a patch integrated which dealt with some of these
>> > > issues.
>> > > > > What I have to ask is what JDK are you using? I use 1.6.0_25 (I
>> > really
>> > > > need
>> > > > > to upgrade) on my laptop

--
*Lewis*

Re: Problem compiling FeedParser plugin with Nutch 2.1 source

Reply via email to