OK so now I think were at the bottom of it. If you wish to create a linkdb in >= Nutch 1.4 you need to specifically pass the linkdb parameter. This was implemented as not everyone wishes to create a linkdb.
Your invertlinks command should be passed as follows bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir /path/to/segment/dirs then bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb path/to/linkdb -dir path/to/segment/dirs If you are not passing the -linkdb path/to/linkdb explicitly you will be thrown an exception as the linkdb is treated as a segment directory now. On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen <[email protected]> wrote: > Only this: > > 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser > for parsing the arguments. Applications should implement Tool for the same. > 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06 > 17:15:51 > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: > /opt/nutch_1_4/data/crawl/linkdb > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true > 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547 > 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at > 2012-01-06 17:15:52 > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: /opt/nutch_1_4/data/crawl/crawldb > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: /opt/nutch_1_4/data/crawl/linkdb > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547 > 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer - > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse > Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data > Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text > 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting > 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db: > /opt/nutch_1_4/data/crawl/crawldb/ > 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use GenericOptionsParser > for parsing the arguments. Applications should implement Tool for the same. > 2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done > -- Lewis

