On Thu, Mar 24, 2011 at 4:33 PM, Gabriele Kahlout
<[email protected]>wrote:

>
>
> On Thu, Mar 24, 2011 at 1:46 PM, Gabriele Kahlout <
> [email protected]> wrote:
>
>>
>>
>> On Thu, Mar 24, 2011 at 1:36 PM, McGibbney, Lewis John <
>> [email protected]> wrote:
>>
>>> Hi Gabriele,
>>>
>>> Out of curiosity, how large is your crawl job? How many URL's are you
>>> fetching on each increment. Is it a continuous crawl job?
>>>
>>
>> I guess the -topN 1 triggered your interest. I was fetching only one local
>> page out of testing. Now I'm testing to crawl simple wikipedia with -topN
>> 100. I'm also trying to figure out wherether my $3 represents the depth of
>> crawls or not.
>> It's for sure if all the urls <= -topN, but when doing what I'm trying
>> (incremental crawling) I'd like all urls injected to be fetched, in topN
>> increments, rather than start fetch urls found in the previous iteration
>> topN urls.
>>
>
> It indeed is this way. I'guess my options would be:
>
> 1. use a scoring plugin that assigns a lower score to links that the
> initial score, so that urls from the urls list are retrieved first using
> -topN than links added to the db after fetching. My understanding is that
> the OpicScoringFilter right now assigns 0 to start with and so all urls are
> equal and the hashtable works more like a LIFO, hence links are crawled
> before urls in the list.
>
> Essentially I seconded the thoughts of Julien and Ken 
> here<http://search-lucene.com/m/Fi4T8jJiQS&subj=Re+How+to+prioritize+the+fetching+of+outlinks+>
.

My objection to this approach however is that one modifies the score of a
page just to inflence nutch fetching speed/priority, while it has nothing to
do with that page's 'effective' score.


2. Include inject in the loop and have the size of the urls in the file ==
> topN such that one iteration is enough for all urls and then inject again.
> Once the whole list is therefore fetched (with depth=0) one can iterate for
> depth if desired. I guess this solution is aka merging crawls.
>
> I'll be tryin 2. Meanwhile I've changed the script to the attached.
>
> I've had an issue <https://issues.apache.org/jira/browse/NUTCH-971>merging a 
> merged index with another index. Other than using the patch the
workaround is to append part-1 to the output index:
$ bin/nutch merge crawl/temp_indexes/*part-1* crawl/indexes
crawl/new_indexes

I'll contribute the script to the wiki once done with it.





>
>>
>>
>>>
>>> Lewis
>>> ________________________________________
>>> From: Gabriele Kahlout [[email protected]]
>>> Sent: 24 March 2011 12:30
>>> To: [email protected]
>>> Cc: [email protected]; Claudio Martella; [email protected]
>>> Subject: Re: Index while crawling
>>>
>>> This seems to work.
>>>
>>> i=0
>>> while true;
>>> do
>>>    if [[ $i -ge $3 ]]
>>>
>>> Glasgow Caledonian University is a registered Scottish charity, number
>>> SC021474
>>>
>>> Winner: Times Higher Education’s Widening Participation Initiative of the
>>> Year 2009 and Herald Society’s Education Initiative of the Year 2009.
>>>
>>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html
>>>
>>> Winner: Times Higher Education’s Outstanding Support for Early Career
>>> Researchers of the Year 2010, GCU as a lead with Universities Scotland
>>> partners.
>>>
>>> http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
>>>
>>
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to