My crawl keeps getting stopped. Is it because of the first warning? Can you
see something I missed?
2010-11-07 18:33:16,003 WARN domain.DomainURLFilter - Attribute
"file" is not defined in plugin.xml for plugin urlfilter-domain
2010-11-07 18:33:16,016 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2010-11-07 18:33:16,016 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2010-11-07 18:33:16,016 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
2010-11-07 18:33:16,967 WARN crawl.Generator - Generator: 0 records
selected for fetching, exiting ...
2010-11-07 18:33:18,682 INFO segment.SegmentMerger - Merging 1 segments to
/lib/nutch/crawl/MERGEDsegments/20101107183318
2010-11-07 18:33:18,684 WARN segment.SegmentMerger - Input dir
/lib/nutch/crawl/segments/* doesn't exist, skipping.
2010-11-07 18:33:18,684 INFO segment.SegmentMerger - SegmentMerger: using
segment data from: content crawl_generate crawl_fetch crawl_parse parse_data
parse_text
2010-11-07 18:33:18,717 WARN mapred.JobClient - Use GenericOptionsParser
for parsing the arguments. Applications should implement Tool for the same.
2010-11-07 18:33:19,792 INFO crawl.LinkDb - LinkDb: starting at 2010-11-07
18:33:19
2010-11-07 18:33:19,793 INFO crawl.LinkDb - LinkDb: linkdb:
/lib/nutch/crawl/linkdb
2010-11-07 18:33:19,793 INFO crawl.LinkDb - LinkDb: URL normalize: true
2010-11-07 18:33:19,793 INFO crawl.LinkDb - LinkDb: URL filter: true
2010-11-07 18:33:19,802 INFO crawl.LinkDb - LinkDb: adding segment:
/lib/nutch/crawl/segments/*
2010-11-07 18:33:20,167 ERROR crawl.LinkDb - LinkDb:
org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/lib/nutch/crawl/segments/*/parse_data matches 0 files
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190
)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInpu
tFormat.java:44)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at
org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
2010-11-07 18:33:20,826 INFO solr.SolrIndexer - SolrIndexer: starting at
2010-11-07 18:33:20
2010-11-07 18:33:20,898 INFO indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: /lib/nutch/crawl/crawldb
2010-11-07 18:33:20,898 INFO indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: /lib/nutch/crawl/linkdb
2010-11-07 18:33:20,898 INFO indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: /lib/nutch/crawl/segments/*
2010-11-07 18:33:21,684 ERROR solr.SolrIndexer -
org.apache.hadoop.mapred.InvalidInputException: Input Pattern
file:/lib/nutch/crawl/segments/*/crawl_fetch matches 0 files
Input Pattern file:/lib/nutch/crawl/segments/*/crawl_parse matches 0 files
Input Pattern file:/lib/nutch/crawl/segments/*/parse_data matches 0 files
Input Pattern file:/lib/nutch/crawl/segments/*/parse_text matches 0 files
Input path does not exist: file:/lib/nutch/crawl/linkdb/current