Re: Question about Nutch Wicket

Talat Uyarer Thu, 25 Sep 2014 22:31:05 -0700

Sounds great. it will be good for web ui. If you write it, I can review and
add the wiki ;)


Talat
On Sep 26, 2014 8:26 AM, "Nima Falaki" <[email protected]> wrote:

> Nm, I figured it out. had to run the webapp command. There should be a wiki
> to document this. I could volunteer to write one, if nobody else is going
> to do this?
>
>
>
> On Thu, Sep 25, 2014 at 10:23 PM, Nima Falaki <[email protected]>
> wrote:
>
> > Thanks Talat, just wondering is there a set of instructions that I can
> use
> > to get the nutch web admin tool up and running?
> >
> > Nima
> >
> > On Thu, Sep 25, 2014 at 9:37 PM, Talat Uyarer <[email protected]> wrote:
> >
> >> Hi Nima,
> >>
> >> I never used nutch web admin. Web admin that you used, is very old.
> Maybe
> >> you can use our brand new web admin development (
> >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-841).
> >> Now
> >> it is just committed trunk and 2.x branches.
> >>
> >> For your question IMHO start URLs means your seedlist. For nutch accept
> as
> >> a folder or text file.
> >>
> >> Limit URLs mean when crawler start using seed list which new URL will be
> >> accepted for next steps. Actually you can right regex rules in nutch.
> For
> >> example you crawl a home page of news webpage but you want to only get
> >> sports urls. You can write a regex rule for accepting sports URLs.
> >>
> >> For reverse situations you can use Exclude URLs.
> >>
> >> Talat.
> >>  Hello Everyone:
> >>
> >> I am following the directions exactly word for word in this tutorial
> >>
> >> https://github.com/101tec/nutch/wiki/admin-url-upload
> >>
> >> My question is what is the difference between the start and limit urls.
> >> From the wiki I saw that the limit url seems to be a flat list of urls
> we
> >> want to fetch, but then why have a start url to become with?
> >>
> >> Also I noticed that when you do not have a limit url, you get the
> >> following
> >> exception (Note: I am using nutch-gui-0.5-dev),
> >>
> >> When you start a crawl shouildnt there be some sort of pop up box that
> >> occurs pops up saying that you need a limit url? So you dont get this
> >> exception? I can help work on this.
> >>
> >>
> >> 14/09/25 20:34:37  INFO [Thread-3071] (Fetcher.java:970) - Fetcher: done
> >>
> >> 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:217) - bw update:
> >> starting
> >>
> >> 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:218) - bw update:
> >> db: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/crawldb
> >>
> >> 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:219) - bw update:
> >> bwdb: /private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb
> >>
> >> 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:220) - bw update:
> >> segments:
> >>
> >>
> [/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/segments/20140925203433]
> >>
> >> 14/09/25 20:34:37  INFO [Thread-3071] (BWUpdateDb.java:223) - bw update:
> >> wrapping started.
> >>
> >> 14/09/25 20:34:37  WARN [Thread-3071] (JobClient.java:547) - Use
> >> GenericOptionsParser for parsing the arguments. Applications should
> >> implement Tool for the same.
> >>
> >> 14/09/25 20:34:38  INFO [Thread-3071] (BWUpdateDb.java:248) - bw update:
> >> filtering started.
> >>
> >> 14/09/25 20:34:38  WARN [Thread-3071] (JobClient.java:547) - Use
> >> GenericOptionsParser for parsing the arguments. Applications should
> >> implement Tool for the same.
> >>
> >> 14/09/25 20:34:38  WARN [Thread-3071] (StartCrawlRunnable.java:57) - can
> >> not start crawl.14/09/25 20:34:38  WARN [Thread-3071]
> >> (StartCrawlRunnable.java:57) - can not start crawl.
> >>
> >> org.apache.hadoop.mapred.InvalidInputException: Input path does not
> exist:
> >>
> file:/private/tmp/nutch/Nima/crawls/Crawl-2014.09.25_20.34.20/bwdb/current
> >>
> >> at
> >>
> >>
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:179)
> >>
> >> at
> >>
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:39)
> >>
> >> at
> >>
> >>
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:190)
> >>
> >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:797)
> >>
> >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1142)
> >>
> >> at org.apache.nutch.crawl.bw.BWUpdateDb.update(BWUpdateDb.java:263)
> >>
> >> at org.apache.nutch.crawl.CrawlTool.crawl(CrawlTool.java:135)
> >>
> >> at
> >>
> >>
> org.apache.nutch.admin.crawl.StartCrawlRunnable.run(StartCrawlRunnable.java:52)
> >>
> >> at java.lang.Thread.run(Thread.java:744)
> >>
> >>
> >> --
> >>
> >>
> >>
> >> Nima Falaki
> >> Software Engineer
> >> [email protected]
> >>
> >
> >
> >
> > --
> >
> >
> >
> > Nima Falaki
> > Software Engineer
> > [email protected]
> >
> >
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> [email protected]
>

Re: Question about Nutch Wicket

Reply via email to