Hello Everyone:

I am following the directions exactly word for word in this tutorial

https://github.com/101tec/nutch/wiki/admin-url-upload

My question is what is the difference between the start and limit urls.
>From the wiki I saw that the limit url seems to be a flat list of urls we
want to fetch, but then why have a start url to become with?

Also I noticed that when you do not have a limit url, you get the following
exception (Note: I am using nutch-gui-0.5-dev),

When you start a crawl shouildnt there be some sort of pop up box that
occurs pops up saying that you need a limit url? So you dont get this
exception? I can help work on this.


14/09/25 20:34:37  INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:217) - bw update:
starting

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:218) - bw update:
db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:219) - bw update:
bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:220) - bw update:
segments:
[/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433]

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:223) - bw update:
wrapping started.

14/09/25 20:34:37  WARN [Thread-3071] (JobClient.java:547) - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.

14/09/25 20:34:38  INFO [Thread-3071] (BWUpdateDb.java:248) - bw update:
filtering started.

14/09/25 20:34:38  WARN [Thread-3071] (JobClient.java:547) - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.

14/09/25 20:34:38  WARN [Thread-3071] (StartCrawlRunnable.java:57) - can
not start crawl.14/09/25 20:34:38  WARN [Thread-3071]
(StartCrawlRunnable.java:57) - can not start crawl.

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current

at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)

at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)

at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)

at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263)

at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135)

at
org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52)

at java.lang.Thread.run(Thread.java:744)


-- 



Nima Falaki
Software Engineer
[email protected]

Reply via email to