Hi Nima, I never used nutch web admin. Web admin that you used, is very old. Maybe you can use our brand new web admin development ( https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). Now it is just committed trunk and 2.x branches.
For your question IMHO start URLs means your seedlist. For nutch accept as a folder or text file. Limit URLs mean when crawler start using seed list which new URL will be accepted for next steps. Actually you can right regex rules in nutch. For example you crawl a home page of news webpage but you want to only get sports urls. You can write a regex rule for accepting sports URLs. For reverse situations you can use Exclude URLs. Talat. Hello Everyone: I am following the directions exactly word for word in this tutorial https://github.com/101tec/nutch/wiki/admin-url-upload My question is what is the difference between the start and limit urls. >From the wiki I saw that the limit url seems to be a flat list of urls we want to fetch, but then why have a start url to become with? Also I noticed that when you do not have a limit url, you get the following exception (Note: I am using nutch-gui-0.5-dev), When you start a crawl shouildnt there be some sort of pop up box that occurs pops up saying that you need a limit url? So you dont get this exception? I can help work on this. 14/09/25 20:34:37 INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:217) - bw update: starting 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:218) - bw update: db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:219) - bw update: bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:220) - bw update: segments: [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433] 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:223) - bw update: wrapping started. 14/09/25 20:34:37 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 INFO [Thread-3071] (BWUpdateDb.java:248) - bw update: filtering started. 14/09/25 20:34:38 WARN [Thread-3071] (JobClient.java:547) - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl.14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can not start crawl. org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263) at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135) at org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52) at java.lang.Thread.run(Thread.java:744) -- Nima Falaki Software Engineer [email protected]

