On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma
<[email protected]>wrote:

> You don't need to inject every cycle. Inject once then repeat the following
> cycle:
> - fetch
> - parse
> - update linkdb and crawldb
> - index
>

except for that in the cycle to index it prints-out there was the previous
index. Should I first execute rm -r crawl/indexes?
The re-crawl script in the wiki does indexing at last (i.e. not in the
cycle). Until then the fetched pages will not be searchable, but we will not
restart the index from scratch at each cycle, is this the trade off? Is
there no 'incremental' index too?


*bin/nutch index *crawl/indexes crawl/crawldb crawl/linkdb
crawl/segments/20110323073813 crawl/segments/20110323074910
crawl/segments/20110323080127 crawl/segments/20110323081325
crawl/segments/20110323083523 crawl/segments/20110323085632
crawl/segments/20110323091736 crawl/segments/20110323093939
crawl/segments/20110323100053 crawl/segments/20110323102159
crawl/segments/20110323104245 crawl/segments/20110323110421
crawl/segments/20110323112631 crawl/segments/20110323114631
crawl/segments/20110323114800 crawl/segments/20110323114936
crawl/segments/20110323121309 crawl/segments/20110323122425
crawl/segments/20110323123805 crawl/segments/20110323125107
crawl/segments/20110323131222 crawl/segments/20110323133252
crawl/segments/20110323135345 crawl/segments/20110323141600
Indexer: starting at 2011-03-23 14:36:12
Indexer: org.apache.hadoop.mapred.FileAlreadyExistsException: *Output
directory file:/home/gkahlout/nutch-1.2/crawl/indexes already exists*
    at
org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111)
    at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
    at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
    at org.apache.nutch.indexer.Indexer.run(Indexer.java:97)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.Indexer.main(Indexer.java:106)

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Reply via email to