Hi Nima,

I never used nutch web admin. Web admin that you used, is very old. Maybe
you can use our brand new web admin development (
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now
it is just committed trunk and 2.x branches.

For your question IMHO start URLs means your seedlist. For nutch accept as
a folder or text file.

Limit URLs mean when crawler start using seed list which new URL will be
accepted for next steps. Actually you can right regex rules in nutch.  For
example you crawl a home page of news webpage but you want to only get
sports urls. You can write a regex rule for accepting sports URLs.

For reverse situations you can use Exclude URLs.

Talat.
 Hello Everyone:

I am following the directions exactly word for word in this tutorial

https://github.com/101tec/nutch/wiki/admin-url-upload

My question is what is the difference between the start and limit urls.
>From the wiki I saw that the limit url seems to be a flat list of urls we
want to fetch, but then why have a start url to become with?

Also I noticed that when you do not have a limit url, you get the following
exception (Note: I am using nutch-gui-0.5-dev),

When you start a crawl shouildnt there be some sort of pop up box that
occurs pops up saying that you need a limit url? So you dont get this
exception? I can help work on this.


14/09/25 20:34:37  INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:217) - bw update:
starting

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:218) - bw update:
db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:219) - bw update:
bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:220) - bw update:
segments:
[/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433]

14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:223) - bw update:
wrapping started.

14/09/25 20:34:37  WARN [Thread-3071] (JobClient.java:547) - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.

14/09/25 20:34:38  INFO [Thread-3071] (BWUpdateDb.java:248) - bw update:
filtering started.

14/09/25 20:34:38  WARN [Thread-3071] (JobClient.java:547) - Use
GenericOptionsParser for parsing the arguments. Applications should
implement Tool for the same.

14/09/25 20:34:38  WARN [Thread-3071] (StartCrawlRunnable.java:57) - can
not start crawl.14/09/25 20:34:38  WARN [Thread-3071]
(StartCrawlRunnable.java:57) - can not start crawl.

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current

at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)

at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)

at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)

at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)

at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)

at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263)

at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135)

at
org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52)

at java.lang.Thread.run(Thread.java:744)


--



Nima Falaki
Software Engineer
[email protected]

Reply via email to