Re: Index while crawling

Gabriele Kahlout Thu, 24 Mar 2011 05:30:38 -0700

This seems to work.

i=0
while true;
do
    if [[ $i -ge $3 ]]
    then
        break;
    fi
    echo
    echo "generate-fetch-updatedb-invertlinks-index-merge iteration "$i":"


    cmd="bin/nutch generate crawl/crawldb crawl/segments -topN 1"
    echo $cmd
    output=`$cmd`
    echo $output
    if [[ $output == *'0 records selected for fetching'* ]]
    then
        break;
    fi
    s1=`ls -d crawl/segments/2* | tail -1`
    echo $s1

    echo
    cmd="bin/nutch fetch $s1"
    echo $cmd
    $cmd

    echo
    cmd="bin/nutch updatedb crawl/crawldb $s1"
    echo $cmd
    $cmd

    echo
    cmd="bin/nutch invertlinks crawl/linkdb -dir crawl/segments"
    echo $cmd
    $cmd

    indexes="crawl/indexes"
    temp_indexes="crawl/temp_indexes"
    cmd="bin/nutch *index* $indexes crawl/crawldb crawl/linkdb
crawl/segments/*";
    echo
    echo $cmd
    $cmd

    cmd="bin/nutch *merge $temp_indexes $indexes"*;
    echo
    echo $cmd
    $cmd
    *mv $temp_indexes $indexes*

    ((i++))
done


On Wed, Mar 23, 2011 at 5:41 PM, Gabriele Kahlout
<[email protected]>wrote:

> On Wed, Mar 23, 2011 at 2:52 PM, Gabriele Kahlout <
> [email protected]> wrote:
>
>>
>>
>> On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma <
>> [email protected]> wrote:
>>
>>> You don't need to inject every cycle. Inject once then repeat the
>>> following
>>> cycle:
>>> - fetch
>>> - parse
>>> - update linkdb and crawldb
>>> - index
>>>
>>
>> except for that in the cycle to index it prints-out there was the previous
>> index. Should I first execute rm -r crawl/indexes?
>> The re-crawl script in the wiki does indexing at last (i.e. not in the
>> cycle). Until then the fetched pages will not be searchable, but we will not
>> restart the index from scratch at each cycle, is this the trade off? Is
>> there no 'incremental' index too?
>>
>> I first though bin/nutch merge would do the trick and thought of a
> solution around:
> indexes="crawl/indexes"
>     index_or_merge="*merge*"
>     if [ ! -d $indexes ]; then
>         index_or_merge="*index*";
>     fi
>     cmd="bin/nutch $index_or_merge $indexes crawl/crawldb crawl/linkdb
> crawl/segments/*";
>     $cmd
>
> But that doesn't index the new data (among other problems) and so I thought
> of 2 passes:
> indexes="crawl/indexes/*$i*"
>     cmd="bin/nutch *index* $indexes crawl/crawldb crawl/linkdb
> crawl/segments/*";
>     $cmd
>
>     cmd="bin/nutch *merge* crawl/index crawl/indexes/ crawl/crawldb
> crawl/linkdb crawl/segments/*";
>     $cmd
>     $i++
>
> bin/nutch merge crawl/index crawl/indexes/ crawl/crawldb crawl/linkdb
> crawl/segments/20110323173111
> IndexMerger: starting at 2011-03-23 17:32:29
> IndexMerger: merging indexes to: crawl/index
> Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/0
> Adding file:/Users/simpatico/nutch-1.2/crawl/crawldb/current
> Adding file:/Users/simpatico/nutch-1.2/crawl/linkdb/current
> Adding
> file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/content
> Adding
> file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_fetch
> Adding
> file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_generate
> Adding
> file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_parse
> Adding
> file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/parse_data
> Adding
> file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/parse_text
> *IndexMerger: java.io.FileNotFoundException: no segments* file found in
> org.apache.nutch.indexer.FsDirectory@file:/Users/simpatico/nutch-1.2/crawl/indexes/0:
> files: [part-00000]*
>     at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:628)
>     at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:521)
>     at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:308)
>     at
> org.apache.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:3028)
>     at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:109)
>     at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:163)
>
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:125)
>
> What shall Id o?
>
>
>> *bin/nutch index *crawl/indexes crawl/crawldb crawl/linkdb
>> crawl/segments/20110323073813 crawl/segments/20110323074910
>> crawl/segments/20110323080127 crawl/segments/20110323081325
>> crawl/segments/20110323083523 crawl/segments/20110323085632
>> crawl/segments/20110323091736 crawl/segments/20110323093939
>> crawl/segments/20110323100053 crawl/segments/20110323102159
>> crawl/segments/20110323104245 crawl/segments/20110323110421
>> crawl/segments/20110323112631 crawl/segments/20110323114631
>> crawl/segments/20110323114800 crawl/segments/20110323114936
>> crawl/segments/20110323121309 crawl/segments/20110323122425
>> crawl/segments/20110323123805 crawl/segments/20110323125107
>> crawl/segments/20110323131222 crawl/segments/20110323133252
>> crawl/segments/20110323135345 crawl/segments/20110323141600
>> Indexer: starting at 2011-03-23 14:36:12
>> Indexer: org.apache.hadoop.mapred.FileAlreadyExistsException: *Output
>> directory file:/home/gkahlout/nutch-1.2/crawl/indexes already exists*
>>     at
>> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
>>     at
>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
>>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:97)
>>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:106)
>>
>>
>> --
>> Regards,
>> K. Gabriele
>>
>> --- unchanged since 20/9/10 ---
>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>> receipt within 48 hours then I don't resend the email.
>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>
>> If an email is sent by a sender that is not a trusted contact or the email
>> does not contain a valid code then the email is not received. A valid code
>> starts with a hyphen and ends with "X".
>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>> L(-[a-z]+[0-9]X)).
>>
>>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Index while crawling

Reply via email to