Hi Patrick, On Sat, Sep 28, 2013 at 10:10 PM, <[email protected]> wrote:
> > 1. I use this command to start the crawling, as stated in the tutorial > > /bin/bash ./bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr/2 > > So when the crawled pages will be sent to Solr for indexing? If you look in to the crawl script, you will see that it executes tasks sequentially. Indexing is one of the latter tasks which by default is executed. > As when I look > at the Solr dashboard, the number of docs is not increasing when the > crawling is in progress. > This is fine. It hopefully just means that the indexing command has not been executed yet. > > 2. About error handling. If some java exceptions are thrown in the middle > of > crawling, how can I know the crawled data are indexed, and where will the > crawling resume if I execute the above command? > You can and should always read the contents of your crawldb to check that ther URLs you want to crawl are contained there and that they are being parsed and processed. We try our best to fail some operations gracefully and continue crawling, however what is nice is that some other tasks which have incorrect parameters as input should fail fast and stop the crawling process. Indexing is one example. > > 3. Any advice about executing the crawling if I want to index those > frequently updated pages, e.g. bbc news? > > There are numerous threads over a long duration of time on our user@lists. > You should take a bit of time and read through the relevant threads it will save you loads of time in the long run. hth Lewis

