Re: Index while crawling

Gabriele Kahlout Wed, 23 Mar 2011 09:41:46 -0700

On Wed, Mar 23, 2011 at 2:52 PM, Gabriele Kahlout
<[email protected]>wrote:


>
>
> On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma <[email protected]
> > wrote:
>
>> You don't need to inject every cycle. Inject once then repeat the
>> following
>> cycle:
>> - fetch
>> - parse
>> - update linkdb and crawldb
>> - index
>>
>
> except for that in the cycle to index it prints-out there was the previous
> index. Should I first execute rm -r crawl/indexes?
> The re-crawl script in the wiki does indexing at last (i.e. not in the
> cycle). Until then the fetched pages will not be searchable, but we will not
> restart the index from scratch at each cycle, is this the trade off? Is
> there no 'incremental' index too?
>
> I first though bin/nutch merge would do the trick and thought of a solution
around:
indexes="crawl/indexes"
    index_or_merge="*merge*"
    if [ ! -d $indexes ]; then
        index_or_merge="*index*";
    fi
    cmd="bin/nutch $index_or_merge $indexes crawl/crawldb crawl/linkdb
crawl/segments/*";
    $cmd

But that doesn't index the new data (among other problems) and so I thought
of 2 passes:
indexes="crawl/indexes/*$i*"
    cmd="bin/nutch *index* $indexes crawl/crawldb crawl/linkdb
crawl/segments/*";
    $cmd

    cmd="bin/nutch *merge* crawl/index crawl/indexes/ crawl/crawldb
crawl/linkdb crawl/segments/*";
    $cmd
    $i++

bin/nutch merge crawl/index crawl/indexes/ crawl/crawldb crawl/linkdb
crawl/segments/20110323173111
IndexMerger: starting at 2011-03-23 17:32:29
IndexMerger: merging indexes to: crawl/index
Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/0
Adding file:/Users/simpatico/nutch-1.2/crawl/crawldb/current
Adding file:/Users/simpatico/nutch-1.2/crawl/linkdb/current
Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/content
Adding
file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_fetch
Adding
file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_generate
Adding
file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_parse
Adding
file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/parse_data
Adding
file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/parse_text
*IndexMerger: java.io.FileNotFoundException: no segments* file found in
org.apache.nutch.indexer.FsDirectory@file:/Users/simpatico/nutch-1.2/crawl/indexes/0:
files: [part-00000]*
    at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:628)
    at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:521)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:308)
    at
org.apache.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:3028)
    at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:109)
    at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:163)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:125)

What shall Id o?


> *bin/nutch index *crawl/indexes crawl/crawldb crawl/linkdb
> crawl/segments/20110323073813 crawl/segments/20110323074910
> crawl/segments/20110323080127 crawl/segments/20110323081325
> crawl/segments/20110323083523 crawl/segments/20110323085632
> crawl/segments/20110323091736 crawl/segments/20110323093939
> crawl/segments/20110323100053 crawl/segments/20110323102159
> crawl/segments/20110323104245 crawl/segments/20110323110421
> crawl/segments/20110323112631 crawl/segments/20110323114631
> crawl/segments/20110323114800 crawl/segments/20110323114936
> crawl/segments/20110323121309 crawl/segments/20110323122425
> crawl/segments/20110323123805 crawl/segments/20110323125107
> crawl/segments/20110323131222 crawl/segments/20110323133252
> crawl/segments/20110323135345 crawl/segments/20110323141600
> Indexer: starting at 2011-03-23 14:36:12
> Indexer: org.apache.hadoop.mapred.FileAlreadyExistsException: *Output
> directory file:/home/gkahlout/nutch-1.2/crawl/indexes already exists*
>     at
> org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
>     at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
>     at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
>     at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
>     at org.apache.nutch.indexer.Indexer.run(Indexer.java:97)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>     at org.apache.nutch.indexer.Indexer.main(Indexer.java:106)
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Index while crawling

Reply via email to