Thanks Gabriele,

It has been my intention to get hands on and begin working on a reliable script 
which includes recently added commands etc.

This is a starting point for me

Thank you Lewis
________________________________________
From: Gabriele Kahlout [[email protected]]
Sent: 27 March 2011 15:24
To: [email protected]
Cc: McGibbney, Lewis John
Subject: Re: Index while crawling

On Fri, Mar 25, 2011 at 3:44 PM, McGibbney, Lewis John <
[email protected]> wrote:

> Hi Gabriele,
>
> Would it be worth making this script available on the wiki with an
> explanation of exactly what it's purpose is, what it does, and a use case.
>
> When I get a chance I will try it out using Solr as indexing mechanism.
>

I've posted the script [1].
It's set up so that it works with Solr, but haven't yet posted a Hadoop
edition, since I still need to familiarize with it.

[1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script


> Thank you for this
>
> Lewis
> ________________________________________
> From: Gabriele Kahlout [[email protected]]
> Sent: 24 March 2011 15:33
> To: [email protected]
> Cc: McGibbney, Lewis John; [email protected]
> Subject: Re: Index while crawling
>
> It indeed is this way. I'guess my options would be:
>
> 1. use a scoring plugin that assigns a lower score to links that the
> initial score, so that urls from the urls list are retrieved first using
> -topN than links added to the db after fetching. My understanding is that
> the OpicScoringFilter right now assigns 0 to start with and so all urls are
> equal and the hashtable works more like a LIFO, hence links are crawled
> before urls in the list.
>
> 2. Include inject in the loop and have the size of the urls in the file ==
> topN such that one iteration is enough for all urls and then inject again.
> Once the whole list is therefore fetched (with depth=0) one can iterate for
> depth if desired. I guess this solution is aka merging crawls.
>
> I'll be tryin 2. Meanwhile I've changed the script to the attached.
>
>
> Glasgow Caledonian University is a registered Scottish charity, number
> SC021474
>
> Winner: Times Higher Education’s Widening Participation Initiative of the
> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>
> Winner: Times Higher Education’s Outstanding Support for Early Career
> Researchers of the Year 2010, GCU as a lead with Universities Scotland
> partners.
>
> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>



--
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).
Email has been scanned for viruses by Altman Technologies' email management 
service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Reply via email to