On Wed, Mar 23, 2011 at 2:52 PM, Gabriele Kahlout <[email protected]>wrote:
> > > On Tue, Mar 22, 2011 at 2:28 PM, Markus Jelsma <[email protected] > > wrote: > >> You don't need to inject every cycle. Inject once then repeat the >> following >> cycle: >> - fetch >> - parse >> - update linkdb and crawldb >> - index >> > > except for that in the cycle to index it prints-out there was the previous > index. Should I first execute rm -r crawl/indexes? > The re-crawl script in the wiki does indexing at last (i.e. not in the > cycle). Until then the fetched pages will not be searchable, but we will not > restart the index from scratch at each cycle, is this the trade off? Is > there no 'incremental' index too? > > I first though bin/nutch merge would do the trick and thought of a solution around: indexes="crawl/indexes" index_or_merge="*merge*" if [ ! -d $indexes ]; then index_or_merge="*index*"; fi cmd="bin/nutch $index_or_merge $indexes crawl/crawldb crawl/linkdb crawl/segments/*"; $cmd But that doesn't index the new data (among other problems) and so I thought of 2 passes: indexes="crawl/indexes/*$i*" cmd="bin/nutch *index* $indexes crawl/crawldb crawl/linkdb crawl/segments/*"; $cmd cmd="bin/nutch *merge* crawl/index crawl/indexes/ crawl/crawldb crawl/linkdb crawl/segments/*"; $cmd $i++ bin/nutch merge crawl/index crawl/indexes/ crawl/crawldb crawl/linkdb crawl/segments/20110323173111 IndexMerger: starting at 2011-03-23 17:32:29 IndexMerger: merging indexes to: crawl/index Adding file:/Users/simpatico/nutch-1.2/crawl/indexes/0 Adding file:/Users/simpatico/nutch-1.2/crawl/crawldb/current Adding file:/Users/simpatico/nutch-1.2/crawl/linkdb/current Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/content Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_fetch Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_generate Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/crawl_parse Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/parse_data Adding file:/Users/simpatico/nutch-1.2/crawl/segments/20110323173111/parse_text *IndexMerger: java.io.FileNotFoundException: no segments* file found in org.apache.nutch.indexer.FsDirectory@file:/Users/simpatico/nutch-1.2/crawl/indexes/0: files: [part-00000]* at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:628) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:521) at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:308) at org.apache.lucene.index.IndexWriter.addIndexesNoOptimize(IndexWriter.java:3028) at org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:109) at org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:163) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:125) What shall Id o? > *bin/nutch index *crawl/indexes crawl/crawldb crawl/linkdb > crawl/segments/20110323073813 crawl/segments/20110323074910 > crawl/segments/20110323080127 crawl/segments/20110323081325 > crawl/segments/20110323083523 crawl/segments/20110323085632 > crawl/segments/20110323091736 crawl/segments/20110323093939 > crawl/segments/20110323100053 crawl/segments/20110323102159 > crawl/segments/20110323104245 crawl/segments/20110323110421 > crawl/segments/20110323112631 crawl/segments/20110323114631 > crawl/segments/20110323114800 crawl/segments/20110323114936 > crawl/segments/20110323121309 crawl/segments/20110323122425 > crawl/segments/20110323123805 crawl/segments/20110323125107 > crawl/segments/20110323131222 crawl/segments/20110323133252 > crawl/segments/20110323135345 crawl/segments/20110323141600 > Indexer: starting at 2011-03-23 14:36:12 > Indexer: org.apache.hadoop.mapred.FileAlreadyExistsException: *Output > directory file:/home/gkahlout/nutch-1.2/crawl/indexes already exists* > at > org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:111) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.indexer.Indexer.index(Indexer.java:76) > at org.apache.nutch.indexer.Indexer.run(Indexer.java:97) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.indexer.Indexer.main(Indexer.java:106) > > > -- > Regards, > K. Gabriele > > --- unchanged since 20/9/10 --- > P.S. If the subject contains "[LON]" or the addressee acknowledges the > receipt within 48 hours then I don't resend the email. > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ > time(x) < Now + 48h) ⇒ ¬resend(I, this). > > If an email is sent by a sender that is not a trusted contact or the email > does not contain a valid code then the email is not received. A valid code > starts with a hyphen and ends with "X". > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ > L(-[a-z]+[0-9]X)). > > -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X". ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

