Re: nutch crawl everything

BlackIce Fri, 09 Sep 2016 23:42:06 -0700

You will need to run Nutch several times in order to fetch everything.

If you have one URL in your seed.txt, it will only index ONE page/file ie:
Index.html of that URL - then process this page and add all links it finds
in index.html to the database. On the next run it will then fetch the links
it found in the first run, on the 3rd run it will fetch the links it found
on the 2nd run and so forth...



Have a great weekend everyone !

On Fri, Sep 9, 2016 at 9:05 PM, Comcast <[email protected]> wrote:

> Tried that. Same result
>
> Sent from my iPhone
>
> > On Sep 9, 2016, at 3:04 PM, BlackIce <[email protected]> wrote:
> >
> > Change the -1 to a positive number like 5 or so.... (In the command)
> >
> >> On Sep 9, 2016 8:20 PM, "KRIS MUSSHORN" <[email protected]> wrote:
> >>
> >> Executing this does NOT index everything in and under seed.txt.
> >>
> >> ./bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TEST_CORE
> >> urls/ crawl -1
> >>
> >> I have to run it multiple times to get all content.
> >>
> >> Is it possible related to this setting in nutch-site.xml?
> >>
> >> <property>
> >> <name>db.max.outlinks.per.page</name>
> >> <value>-1</value>
> >> <description>
> >> allow unlimited outlinks with -1
> >> </description>
> >> </property>
> >>
> >> Thx,
> >>
> >> Kris
> >>
>

Re: nutch crawl everything

Reply via email to