Re: some questions about nutch from a new user...

Lewis John Mcgibbney Wed, 02 Oct 2013 17:47:11 -0700

Hi Patrick,

On Sat, Sep 28, 2013 at 10:10 PM, <[email protected]> wrote:


>
> 1. I use this command to start the crawling, as stated in the tutorial
>
> /bin/bash ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/2
>
> So when the crawled pages will be sent to Solr for indexing?


If you look in to the crawl script, you will see that it executes tasks
sequentially. Indexing is one of the latter tasks which by default is
executed.


> As when I look
> at the Solr dashboard, the number of docs is not increasing when the
> crawling is in progress.
>

This is fine. It hopefully just means that the indexing command has not
been executed yet.


>
> 2. About error handling. If some java exceptions are thrown in the middle
> of
> crawling, how can I know the crawled data are indexed, and where will the
> crawling resume if I execute the above command?
>

You can and should always read the contents of your crawldb to check that
ther URLs you want to crawl are contained there and that they are being
parsed and processed. We try our best to fail some operations gracefully
and continue crawling, however what is nice is that some other tasks which
have incorrect parameters as input should fail fast and stop the crawling
process. Indexing is one example.


>
> 3. Any advice about executing the crawling if I want to index those
> frequently updated pages, e.g. bbc news?
>
> There are numerous threads over a long duration of time on our user@lists. 
> You should take a bit of time and read through the relevant threads
it will save you loads of time in the long run.
hth
Lewis

Re: some questions about nutch from a new user...

Reply via email to