Sounds great. it will be good for web ui. If you write it, I can review and add the wiki ;)
Talat On Sep 26, 2014 8:26 AM, "Nima Falaki" <[email protected]> wrote: > Nm, I figured it out. had to run the webapp command. There should be a wiki > to document this. I could volunteer to write one, if nobody else is going > to do this? > > > > On Thu, Sep 25, 2014 at 10:23 PM, Nima Falaki <[email protected]> > wrote: > > > Thanks Talat, just wondering is there a set of instructions that I can > use > > to get the nutch web admin tool up and running? > > > > Nima > > > > On Thu, Sep 25, 2014 at 9:37 PM, Talat Uyarer <[email protected]> wrote: > > > >> Hi Nima, > >> > >> I never used nutch web admin. Web admin that you used, is very old. > Maybe > >> you can use our brand new web admin development ( > >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841). > >> Now > >> it is just committed trunk and 2.x branches. > >> > >> For your question IMHO start URLs means your seedlist. For nutch accept > as > >> a folder or text file. > >> > >> Limit URLs mean when crawler start using seed list which new URL will be > >> accepted for next steps. Actually you can right regex rules in nutch. > For > >> example you crawl a home page of news webpage but you want to only get > >> sports urls. You can write a regex rule for accepting sports URLs. > >> > >> For reverse situations you can use Exclude URLs. > >> > >> Talat. > >> Hello Everyone: > >> > >> I am following the directions exactly word for word in this tutorial > >> > >> https://github.com/101tec/nutch/wiki/admin-url-upload > >> > >> My question is what is the difference between the start and limit urls. > >> From the wiki I saw that the limit url seems to be a flat list of urls > we > >> want to fetch, but then why have a start url to become with? > >> > >> Also I noticed that when you do not have a limit url, you get the > >> following > >> exception (Note: I am using nutch-gui-0.5-dev), > >> > >> When you start a crawl shouildnt there be some sort of pop up box that > >> occurs pops up saying that you need a limit url? So you dont get this > >> exception? I can help work on this. > >> > >> > >> 14/09/25 20:34:37 INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done > >> > >> 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:217) - bw update: > >> starting > >> > >> 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:218) - bw update: > >> db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb > >> > >> 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:219) - bw update: > >> bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb > >> > >> 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:220) - bw update: > >> segments: > >> > >> > [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433] > >> > >> 14/09/25 20:34:37 INFO [Thread-3071] (BWUpdateDb.java:223) - bw update: > >> wrapping started. > >> > >> 14/09/25 20:34:37 WARN [Thread-3071] (JobClient.java:547) - Use > >> GenericOptionsParser for parsing the arguments. Applications should > >> implement Tool for the same. > >> > >> 14/09/25 20:34:38 INFO [Thread-3071] (BWUpdateDb.java:248) - bw update: > >> filtering started. > >> > >> 14/09/25 20:34:38 WARN [Thread-3071] (JobClient.java:547) - Use > >> GenericOptionsParser for parsing the arguments. Applications should > >> implement Tool for the same. > >> > >> 14/09/25 20:34:38 WARN [Thread-3071] (StartCrawlRunnable.java:57) - can > >> not start crawl.14/09/25 20:34:38 WARN [Thread-3071] > >> (StartCrawlRunnable.java:57) - can not start crawl. > >> > >> org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > >> > file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current > >> > >> at > >> > >> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179) > >> > >> at > >> > >> > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39) > >> > >> at > >> > >> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190) > >> > >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797) > >> > >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142) > >> > >> at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263) > >> > >> at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135) > >> > >> at > >> > >> > org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52) > >> > >> at java.lang.Thread.run(Thread.java:744) > >> > >> > >> -- > >> > >> > >> > >> Nima Falaki > >> Software Engineer > >> [email protected] > >> > > > > > > > > -- > > > > > > > > Nima Falaki > > Software Engineer > > [email protected] > > > > > > > -- > > > > Nima Falaki > Software Engineer > [email protected] >

