Hello List,

Only material I could find on this was a post by myself (some time ago) which 
addressed a slightly different problem case.

During the indexing stage of a recrawl, my Hadoop log reads as follows

Indexer: starting at 2011-01-10 16:40:42
Indexer: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory f
ile:/C:/Downloads/Apache/nutch-1.2/crawl/indexes already exists
        at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutput
Format.java:111)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7
72)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
        at org.apache.nutch.indexer.Indexer.index(Indexer.java:76)
        at org.apache.nutch.indexer.Indexer.run(Indexer.java:97)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.Indexer.main(Indexer.java:106)

My quick question is, is it necessary to delete/remove existing indexes before 
I can index freshly fetched web data?

Thank you

Lewis


Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education's Widening Participation Initiative of the Year 
2009 and Herald Society's Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Reply via email to