Hi, after a second look: the Solr error only affects the cleaning job. After checking the logs carefully:
- only one page is fetched 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - fetching https://dev-abc.com/letters (queue crawl delay=30000ms) - and one page is sent as deletion (probably a 404) to the indexer 2018-07-20 09:46:23,769 INFO solr.SolrIndexWriter - SolrIndexer: deleting 1/1 documents But given only the logs I don't see a way to find out why the page failed to fetch. The CrawlDb contains the fetch status and usually also a status message which explains the failure. Best, Sebastian On 07/23/2018 04:15 PM, Rushi wrote: > Hi Sebastian, > I am using Solr 6.4.2.But i am surprised with the same configuration Nutch > 1.13 and Solr 6.4.2 crawling/indexing with Prod urls seems to be working > fine without any issues. > > On Mon, Jul 23, 2018 at 7:37 AM Sebastian Nagel > <wastl.na...@googlemail.com.invalid> wrote: > >> Hi, >> >> there is an exception "Connection pool shut down". >> Which version of Solr are you running? It should be >> Solr 5.5.0 for Nutch 1.13. >> >> Sebastian >> >> On 07/20/2018 03:58 PM, Rushi wrote: >>> Thanks for the response Sebastian, >>> Yeah i changed my seeds and i am using Nutch 1.13 >>> >>> Here is the log >>> 2018-07-20 09:45:49,769 INFO crawl.Injector - Injector: starting at >>> 2018-07-20 09:45:49 >>> 2018-07-20 09:45:49,770 INFO crawl.Injector - Injector: crawlDb: >>> TestCra7sl/crawldb >>> 2018-07-20 09:45:49,770 INFO crawl.Injector - Injector: urlDir: urls >>> 2018-07-20 09:45:49,770 INFO crawl.Injector - Injector: Converting >>> injected urls to crawl db entries. >>> 2018-07-20 09:45:49,894 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:45:51,672 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:45:51,688 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:45:51,759 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:45:51,839 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'inject', using default >>> 2018-07-20 09:45:51,985 INFO crawl.Injector - Injector: overwrite: false >>> 2018-07-20 09:45:51,985 INFO crawl.Injector - Injector: update: false >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total urls >>> rejected by filters: 0 >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total urls >>> injected after normalization and filtering: 1 >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total urls >>> injected but already in CrawlDb: 0 >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total new urls >>> injected: 1 >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: finished at >>> 2018-07-20 09:45:52, elapsed: 00:00:02 >>> 2018-07-20 09:45:53,235 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:45:53,374 INFO crawl.Generator - Generator: starting at >>> 2018-07-20 09:45:53 >>> 2018-07-20 09:45:53,374 INFO crawl.Generator - Generator: Selecting >>> best-scoring urls due for fetch. >>> 2018-07-20 09:45:53,374 INFO crawl.Generator - Generator: filtering: >> false >>> 2018-07-20 09:45:53,375 INFO crawl.Generator - Generator: normalizing: >> true >>> 2018-07-20 09:45:53,375 INFO crawl.Generator - Generator: topN: 50000 >>> 2018-07-20 09:45:54,084 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:45:54,088 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:45:54,109 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:45:54,146 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2018-07-20 09:45:54,147 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2018-07-20 09:45:54,147 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2018-07-20 09:45:54,154 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'partition', using default >>> 2018-07-20 09:45:54,233 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2018-07-20 09:45:54,233 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2018-07-20 09:45:54,233 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2018-07-20 09:45:54,243 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2018-07-20 09:45:54,243 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2018-07-20 09:45:54,243 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2018-07-20 09:45:54,244 INFO regex.RegexURLNormalizer - can't find rules >>> for scope 'generate_host_count', using default >>> 2018-07-20 09:45:54,915 INFO crawl.Generator - Generator: Partitioning >>> selected urls for politeness. >>> 2018-07-20 09:45:55,916 INFO crawl.Generator - Generator: segment: >>> TestCra7sl/segments/20180720094555 >>> 2018-07-20 09:45:57,096 INFO crawl.Generator - Generator: finished at >>> 2018-07-20 09:45:57, elapsed: 00:00:03 >>> 2018-07-20 09:45:57,928 INFO fetcher.Fetcher - Fetcher: starting at >>> 2018-07-20 09:45:57 >>> 2018-07-20 09:45:57,929 INFO fetcher.Fetcher - Fetcher: segment: >>> TestCra7sl/segments/20180720094555 >>> 2018-07-20 09:45:57,929 INFO fetcher.Fetcher - Fetcher Timelimit set >> for : >>> 1532105157929 >>> 2018-07-20 09:45:58,073 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:45:58,800 INFO fetcher.FetchItemQueues - Using queue mode >> : >>> byHost >>> 2018-07-20 09:45:58,800 INFO fetcher.Fetcher - Fetcher: threads: 50 >>> 2018-07-20 09:45:58,800 INFO fetcher.Fetcher - Fetcher: time-out >> divisor: 2 >>> 2018-07-20 09:45:58,804 INFO fetcher.QueueFeeder - QueueFeeder finished: >>> total 1 records + hit by time limit :0 >>> 2018-07-20 09:45:58,852 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:45:58,855 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:45:58,875 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:45:58,901 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,917 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,917 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - fetching >>> https://dev-abc.com/letters (queue crawl delay=30000ms) >>> 2018-07-20 09:45:58,918 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,918 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,919 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,919 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,919 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,921 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,921 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,921 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,922 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,922 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,923 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,923 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,923 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,924 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,924 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,925 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,925 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,925 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,926 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,926 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,926 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,927 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,927 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,927 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,928 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,928 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,928 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,929 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,929 INFO protocol.RobotRulesParser - robots.txt >>> whitelist not configured. >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,929 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,929 INFO http.Http - http.proxy.host = null >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,929 INFO http.Http - http.proxy.port = 8080 >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,929 INFO http.Http - http.proxy.exception.list = >> false >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,929 INFO http.Http - http.timeout = 10000 >>> 2018-07-20 09:45:58,930 INFO http.Http - http.content.limit = -1 >>> 2018-07-20 09:45:58,930 INFO http.Http - http.agent = >>> nutch-solr-integration/Nutch-1.13-SNAPSHOT >>> 2018-07-20 09:45:58,930 INFO http.Http - http.accept.language = >>> en-us,en-gb,en;q=0.7,*;q=0.3 >>> 2018-07-20 09:45:58,930 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,936 INFO http.Http - http.accept = >>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,936 INFO http.Http - http.enable.cookie.header = >> true >>> 2018-07-20 09:45:58,936 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,937 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,937 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,937 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,938 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,938 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,938 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,939 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,939 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,939 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,940 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,940 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,940 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,941 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - Using queue mode : >>> byHost >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:45:58,941 INFO fetcher.Fetcher - Fetcher: throughput >>> threshold: -1 >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=1 >>> 2018-07-20 09:45:58,941 INFO fetcher.Fetcher - Fetcher: throughput >>> threshold retries: 5 >>> 2018-07-20 09:45:58,941 INFO fetcher.Fetcher - fetcher.maxNum.threads >>> can't be < than 50 : using 50 instead >>> 2018-07-20 09:45:59,945 INFO fetcher.Fetcher - -activeThreads=1, >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 >>> 2018-07-20 09:46:00,951 INFO fetcher.Fetcher - -activeThreads=1, >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 >>> 2018-07-20 09:46:01,955 INFO fetcher.Fetcher - -activeThreads=1, >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 >>> 2018-07-20 09:46:02,957 INFO fetcher.Fetcher - -activeThreads=1, >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 >>> 2018-07-20 09:46:03,959 INFO fetcher.Fetcher - -activeThreads=1, >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 >>> 2018-07-20 09:46:04,756 INFO fetcher.FetcherThread - Thread >> FetcherThread >>> has no more work available >>> 2018-07-20 09:46:04,756 INFO fetcher.FetcherThread - -finishing thread >>> FetcherThread, activeThreads=0 >>> 2018-07-20 09:46:04,964 INFO fetcher.Fetcher - -activeThreads=0, >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 >>> 2018-07-20 09:46:04,964 INFO fetcher.Fetcher - -activeThreads=0 >>> 2018-07-20 09:46:05,709 INFO fetcher.Fetcher - Fetcher: finished at >>> 2018-07-20 09:46:05, elapsed: 00:00:07 >>> 2018-07-20 09:46:06,597 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:46:06,712 INFO parse.ParseSegment - ParseSegment: starting >>> at 2018-07-20 09:46:06 >>> 2018-07-20 09:46:06,713 INFO parse.ParseSegment - ParseSegment: segment: >>> TestCra7sl/segments/20180720094555 >>> 2018-07-20 09:46:07,478 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:46:07,482 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:46:07,502 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:46:07,690 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:46:07,775 INFO net.URLExemptionFilters - Found 0 >> extensions >>> at point:'org.apache.nutch.net.URLExemptionFilter' >>> 2018-07-20 09:46:08,294 INFO parse.ParseSegment - ParseSegment: finished >>> at 2018-07-20 09:46:08, elapsed: 00:00:01 >>> 2018-07-20 09:46:09,234 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: starting at >>> 2018-07-20 09:46:09 >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: db: >>> TestCra7sl/crawldb >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: segments: >>> [TestCra7sl/segments/20180720094555] >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: additions >>> allowed: false >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: URL >>> normalizing: false >>> 2018-07-20 09:46:09,392 INFO crawl.CrawlDb - CrawlDb update: URL >>> filtering: false >>> 2018-07-20 09:46:09,392 INFO crawl.CrawlDb - CrawlDb update: 404 >> purging: >>> false >>> 2018-07-20 09:46:09,393 INFO crawl.CrawlDb - CrawlDb update: Merging >>> segment data into db. >>> 2018-07-20 09:46:10,518 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:46:10,522 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:46:10,541 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:46:10,567 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2018-07-20 09:46:10,567 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2018-07-20 09:46:10,567 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2018-07-20 09:46:10,616 INFO crawl.FetchScheduleFactory - Using >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule >>> 2018-07-20 09:46:10,616 INFO crawl.AbstractFetchSchedule - >>> defaultInterval=2592000 >>> 2018-07-20 09:46:10,616 INFO crawl.AbstractFetchSchedule - >>> maxInterval=7776000 >>> 2018-07-20 09:46:11,005 INFO crawl.CrawlDb - CrawlDb update: finished at >>> 2018-07-20 09:46:11, elapsed: 00:00:01 >>> 2018-07-20 09:46:11,980 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: starting at >> 2018-07-20 >>> 09:46:12 >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: linkdb: >>> TestCra7sl/linkdb >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: URL normalize: true >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: URL filter: true >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: internal links will >> be >>> ignored. >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: adding segment: >>> TestCra7sl/segments/20180720094555 >>> 2018-07-20 09:46:12,922 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:46:12,926 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:46:12,946 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:46:13,726 INFO crawl.LinkDb - LinkDb: finished at >> 2018-07-20 >>> 09:46:13, elapsed: 00:00:01 >>> 2018-07-20 09:46:14,596 INFO crawl.DeduplicationJob - DeduplicationJob: >>> starting at 2018-07-20 09:46:14 >>> 2018-07-20 09:46:14,785 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:46:16,448 INFO crawl.DeduplicationJob - Deduplication: 0 >>> documents marked as duplicates >>> 2018-07-20 09:46:16,449 INFO crawl.DeduplicationJob - Deduplication: >>> Updating status of duplicate urls into crawl db. >>> 2018-07-20 09:46:17,665 INFO crawl.DeduplicationJob - Deduplication >>> finished at 2018-07-20 09:46:17, elapsed: 00:00:03 >>> 2018-07-20 09:46:18,637 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:46:18,776 INFO segment.SegmentChecker - Segment dir is >>> complete: TestCra7sl/segments/20180720094555. >>> 2018-07-20 09:46:18,777 INFO indexer.IndexingJob - Indexer: starting at >>> 2018-07-20 09:46:18 >>> 2018-07-20 09:46:18,780 INFO indexer.IndexingJob - Indexer: deleting >> gone >>> documents: false >>> 2018-07-20 09:46:18,780 INFO indexer.IndexingJob - Indexer: URL >> filtering: >>> false >>> 2018-07-20 09:46:18,780 INFO indexer.IndexingJob - Indexer: URL >>> normalizing: false >>> 2018-07-20 09:46:18,848 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:46:18,853 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:46:18,877 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:46:18,947 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>> 2018-07-20 09:46:18,947 INFO indexer.IndexingJob - Active IndexWriters : >>> SOLRIndexWriter >>> solr.server.url : URL of the SOLR instance >>> solr.zookeeper.hosts : URL of the Zookeeper quorum >>> solr.commit.size : buffer size when sending to SOLR (default 1000) >>> solr.mapping.file : name of the mapping file for fields (default >>> solrindex-mapping.xml) >>> solr.auth : use authentication (default false) >>> solr.auth.username : username for authentication >>> solr.auth.password : password for authentication >>> >>> >>> 2018-07-20 09:46:18,949 INFO indexer.IndexerMapReduce - >> IndexerMapReduce: >>> crawldb: TestCra7sl/crawldb >>> 2018-07-20 09:46:18,949 INFO indexer.IndexerMapReduce - >> IndexerMapReduce: >>> linkdb: TestCra7sl/linkdb >>> 2018-07-20 09:46:18,949 INFO indexer.IndexerMapReduce - >> IndexerMapReduces: >>> adding segment: TestCra7sl/segments/20180720094555 >>> 2018-07-20 09:46:19,781 INFO anchor.AnchorIndexingFilter - Anchor >>> deduplication is: off >>> 2018-07-20 09:46:20,716 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: content >>> dest: content >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: title >> dest: >>> title >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: >>> metatag.description dest: description >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: >>> metatag.section dest: section >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: >>> metatag.gldocname dest: gldocname >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: host dest: >>> host >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: segment >>> dest: segment >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: boost >> dest: >>> boost >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: digest >> dest: >>> digest >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: tstamp >> dest: >>> tstamp >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: >> mainContent >>> dest: mainContent >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: content >>> dest: content >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: title >> dest: >>> title >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: >>> metatag.description dest: description >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: >>> metatag.section dest: section >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: >>> metatag.gldocname dest: gldocname >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: host dest: >>> host >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: segment >>> dest: segment >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: boost >> dest: >>> boost >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: digest >> dest: >>> digest >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: tstamp >> dest: >>> tstamp >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: >> mainContent >>> dest: mainContent >>> 2018-07-20 09:46:21,646 INFO indexer.IndexingJob - Indexer: number of >>> documents indexed, deleted, or skipped: >>> 2018-07-20 09:46:21,652 INFO indexer.IndexingJob - Indexer: finished at >>> 2018-07-20 09:46:21, elapsed: 00:00:02 >>> 2018-07-20 09:46:22,551 INFO indexer.CleaningJob - CleaningJob: starting >>> at 2018-07-20 09:46:22 >>> 2018-07-20 09:46:22,717 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-07-20 09:46:23,395 WARN output.FileOutputCommitter - Output Path is >>> null in setupJob() >>> 2018-07-20 09:46:23,644 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/plugin/plugin.xml` java.io.FileNotFoundException: >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) >>> 2018-07-20 09:46:23,649 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml >>> (No such file or directory) >>> 2018-07-20 09:46:23,669 WARN plugin.PluginRepository - Error while >> loading >>> plugin `/nutch/plugins/parse-replace/plugin.xml` >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml >> (No >>> such file or directory) >>> 2018-07-20 09:46:23,702 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: content >>> dest: content >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: title >> dest: >>> title >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: >>> metatag.description dest: description >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: >>> metatag.section dest: section >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: >>> metatag.gldocname dest: gldocname >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: host dest: >>> host >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: segment >>> dest: segment >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: boost >> dest: >>> boost >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: digest >> dest: >>> digest >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: tstamp >> dest: >>> tstamp >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: >> mainContent >>> dest: mainContent >>> 2018-07-20 09:46:23,769 INFO solr.SolrIndexWriter - SolrIndexer: >> deleting >>> 1/1 documents >>> 2018-07-20 09:46:23,909 WARN output.FileOutputCommitter - Output Path is >>> null in cleanupJob() >>> 2018-07-20 09:46:23,910 WARN mapred.LocalJobRunner - >>> job_local1584437722_0001 >>> java.lang.Exception: java.lang.IllegalStateException: Connection pool >> shut >>> down >>> at >>> >> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) >>> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) >>> Caused by: java.lang.IllegalStateException: Connection pool shut down >>> at org.apache.http.util.Asserts.check(Asserts.java:34) >>> at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) >>> at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) >>> at >>> >> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) >>> at >>> >> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) >>> at >>> >> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) >>> at >>> >> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) >>> at >>> >> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) >>> at >>> >> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) >>> at >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481) >>> at >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240) >>> at >>> >> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229) >>> at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) >>> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482) >>> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463) >>> at >>> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191) >>> at >>> >> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179) >>> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) >>> at >>> >> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:122) >>> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) >>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) >>> at >>> >> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) >>> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>> at >>> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>> at java.lang.Thread.run(Thread.java:748) >>> 2018-07-20 09:46:24,406 ERROR indexer.CleaningJob - CleaningJob: >>> java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) >>> at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174) >>> at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >>> at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208) >>> >>> On Fri, Jul 20, 2018 at 2:06 AM Sebastian Nagel >>> <wastl.na...@googlemail.com.invalid> wrote: >>> >>>> Hi, >>>> >>>>> * Changed my regex-filter to use development domain address. >>>> >>>> Did you also change your seeds? >>>> >>>> The fact that deletions are sent but not additions/updates >>>> suggests that no pages have been successfully crawled. >>>> >>>> Could you specify the Nutch version used and also attach some >>>> log snippets to make it possible to analyze the issue. >>>> >>>> Thanks, >>>> Sebastian >>>> >>>> On 07/19/2018 10:30 PM, Rushi wrote: >>>>> Hi all, >>>>> I was using nutch from last 6 months and it works with Production urls >>>> with out any issue and for >>>>> testing purpose i want make this work on Dev/staging.I followed these >>>> steps >>>>> >>>>> >>>>> And ran this command >>>>> >>>>> ./bin/crawl -i -D solr.server.url= >>>> http://dev.abc.com:8983/solr/automateindex/ urls/ Testcrawl 2 >>>>> >>>>> I dont see that it is indexed but it shows that it is deleted. >>>>> >>>>> >>>>> nutch_—_-bash_—_206×67.png >>>>> >>>>> Note:I tried checking the Dev url with bin/nutch indexchecker >>>> https://dev-abc.com/letters it shows >>>>> me the content. >>>>> >>>>> I would really appreciate suggest me a solution. >>>>> >>>> >>>> >>> >> >> >> >