RE: Processing Pages in Pairs

Iain Lopata Sat, 29 Nov 2014 14:59:57 -0800

Thanks Marcus, your pointers are very helpful.

I have looked at BlockJoins.  Since there is a 1-to-1 relationship between the 
pairs of pages I need to process, I think BlockJoins would add unnecessary 
complexity to the queries. A custom update processor appears to me to be the 
better option.

I have found a couple of useful examples that may help others tackling similar 
problems.

First, I am going to try using the links-extractor indexing plugin found at 
https://github.com/jorgelbg/links-extractor to ensure that I have a reference 
to "Page A" at that time I index "Page B".

Second, I am going to start with solr-field-update UpdateRequestProcessor found 
at https://github.com/guardian/solr-field-update as a template, but will modify 
the lookup approach to use the inlink from the link extractor.

I will still need to build the custom parser for vCard, unless anyone has one 
they can share.  I plan to do this based on ez-vcard found at 
https://code.google.com/p/ez-vcard/wiki/ReadingVCards#3_Differences_between_Ezvcard_and_reader_classes

Plenty to do, but I think you have me headed in the right direction - and 
certainly seems better than hacking the map/reduce processing in the Nutch 
indexer.

Thanks again

-----Original Message-----
From: Markus Jelsma [mailto:[email protected]] 
Sent: Wednesday, November 26, 2014 1:39 PM
To: [email protected]
Subject: RE: Processing Pages in Pairs

Using Solr BlockJoins would probably be the easiest these days unless you 
really need to process them in Nutch. If you still want to process them 
simultaneously you can write a custom Solr UpdateRequestProcessor plugin and 
build the logic there.

-----Original message-----
> From:Lewis John Mcgibbney <[email protected]>
> Sent: Wednesday 26th November 2014 0:10
> To: [email protected]
> Subject: Re: Processing Pages in Pairs
> 
> Hi Iain,
> 
> On Tue, Nov 25, 2014 at 2:44 PM, <[email protected]> wrote:
> 
> >
> >
> > What would you recommend in this situation?  Are there other options 
> > that I am missing?
> 
> 
> I think that our good friend Markus has previously provided some 
> insight into the technical implementation of a task which may be 
> synonymous with what you are trying to achieve.
> http://www.mail-archive.com/user%40nutch.apache.org/msg04695.html
> Sounds pretty hands on to me, it would be difficult to keep your 
> version of Nutch up-to-date with trunk if you were doing that.
> hth
> Lewis
>

RE: Processing Pages in Pairs

Reply via email to