Re: Index while crawling

Gabriele Kahlout Tue, 22 Mar 2011 15:02:31 -0700

This seems to do it as I want:

while true;
do
    cmd="bin/nutch generate crawl/crawldb crawl/segments -topN 200"
    echo $cmd
    output=`$cmd`
    echo $output
    if [[ $output == *'0 records selected for fetching'* ]]
    then
        break;
    fi
    s1=`ls -d crawl/segments/2* | tail -1`
    echo $s1


    cmd="bin/nutch fetch $s1"
    echo $cmd
    $cmd

    cmd="bin/nutch updatedb crawl/crawldb $s1"
    echo $cmd
    $cmd

    cmd="bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
    echo $cmd
    $cmd

    cmd="bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/*"
    echo $cmd
    $cmd
echo
done


On Tue, Mar 22, 2011 at 6:01 PM, Gabriele Kahlout
<[email protected]>wrote:

> Okay, and what about the loop terminating condition. If I'm parsing an
> ulimited domain (the web) then the depth is probably a good option as
> described on the wiki <http://wiki.apache.org/nutch/IntranetRecrawl> and
> on 
> so.com<http://stackoverflow.com/questions/2537874/nutch-how-to-crawl-by-small-patches>,
> but if the domain is limited, i.e. we could finish crawling all of it, we
> just want to do it incrementally then depth is no longer relevant.
>
> Essentially I invision a while(true) where if generate returns no new url
> (Q: how can I know this in the script) it breakes the loop. But generate
> doesn't seem to report this:
>
> bin/nutch generate crawl/crawldb crawl/segments -topN 200
> Generator: starting at 2011-03-22 17:58:24
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 200
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: crawl/segments/20110322175835
> Generator: finished at 2011-03-22 17:58:40, elapsed: 00:00:15
>
>
>
>
> On Tue, Mar 22, 2011 at 5:27 PM, Markus Jelsma <[email protected]
> > wrote:
>
>> Use -topN N. You can also limitByHost via configuration.
>>
>> On Tuesday 22 March 2011 17:20:33 Gabriele Kahlout wrote:
>> > On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma
>> >
>> > <[email protected]>wrote:
>> > > On Tuesday 22 March 2011 14:14:06 Gabriele Kahlout wrote:
>> > > > > Yes, you need to wait. You must finish the fetch, then parse the
>> > > > > fetch and update the crawldb (and optionally the linkdb). Finally
>> > > > > you must index and only then are your documents searchable.
>> > > >
>> > > > I can see injecting fewer urls at a time. I.e. I complete a
>> > > > inject-fetch-index cycle and then re-start it with new urls.
>> > >
>> > > You don't need to inject every cycle. Inject once then repeat the
>> > > following
>> >
>> > Yes, but how do I limit the # urls fetched at each cycle?
>> > Are we talking about -maxNumSegments?
>> > $ bin/nutch generate
>> > Usage: Generator <crawldb> <segments_dir> [-force] [-topN N]
>> [-numFetchers
>> > numFetchers] [-adddays numDays] [-noFilter] [-noNorm][*-maxNumSegments
>> > num*]
>> >
>> > > cycle:
>> > > - fetch
>> >
>> > - parse
>> >
>> > > - update linkdb and crawldb
>> > > - index
>> > >
>> > > > Q1: After the 1st iteration can I start searching, while the 2nd
>> > >
>> > > iteration
>> > >
>> > > > is in progress?
>> > >
>> > > Yes. Once you indexed the data you can start the 2nd iteration and
>> > > search.
>> > >
>> > > > Q2: during the fetch of the 2nd iteration, what prevents fetch from
>> > > > fetching again what was fetched in the 1st iteration (assuming it's
>> > > > still before db.fetch.interval.default)?
>> > >
>> > > Well, if fetch_time + interval < NOW then it won't get fetched.
>> > >
>> > > > I'm not sure if fetching fewer segments and index them, and then
>> fetch
>> > >
>> > > more
>> > >
>> > > > (i.e. iterate only fetch-index) is a better option, such that after
>> the
>> > >
>> > > 1st
>> > >
>> > > > iteration I can start searching.
>> > > >
>> > > >
>> > > > Thank you.
>> > > >
>> > > > > > >but remember that results don't come available for searching
>> > > > > > >immediately after
>> > > > > >
>> > > > > > *fetching*. *all* pages must be fetched andf then* indexed*
>> first
>> > > > > > to be searchable.
>> > > > >
>> > > > > --
>> > > > > Markus Jelsma - CTO - Openindex
>> > > > > http://www.linkedin.com/in/markus17
>> > > > > 050-8536620 / 06-50258350
>> > >
>> > > --
>> > > Markus Jelsma - CTO - Openindex
>> > > http://www.linkedin.com/in/markus17
>> > > 050-8536620 / 06-50258350
>>
>> --
>> Markus Jelsma - CTO - Openindex
>> http://www.linkedin.com/in/markus17
>> 050-8536620 / 06-50258350
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Index while crawling

Reply via email to