FYI, I finally figured out the problem why it is not indexing Here is the solution as website is internal we are not allowing the crawlers i went to the nutch-sitemap.xml file and added the folowing
<name>http.robot.rules.whitelist</name> <value>dev-abc.com <https://dev-abc.com/letters></value> And then ran the indexer it works now. On Mon, Jul 23, 2018 at 10:46 AM Sebastian Nagel <wastl.na...@googlemail.com.invalid> wrote: > Hi, > > after a second look: the Solr error only affects the cleaning job. > After checking the logs carefully: > > - only one page is fetched > 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - fetching > https://dev-abc.com/letters (queue > crawl delay=30000ms) > > - and one page is sent as deletion (probably a 404) to the indexer > 2018-07-20 09:46:23,769 INFO solr.SolrIndexWriter - SolrIndexer: > deleting > 1/1 documents > > But given only the logs I don't see a way to find out why the > page failed to fetch. The CrawlDb contains the fetch status and > usually also a status message which explains the failure. > > Best, > Sebastian > > On 07/23/2018 04:15 PM, Rushi wrote: > > Hi Sebastian, > > I am using Solr 6.4.2.But i am surprised with the same configuration > Nutch > > 1.13 and Solr 6.4.2 crawling/indexing with Prod urls seems to be working > > fine without any issues. > > > > On Mon, Jul 23, 2018 at 7:37 AM Sebastian Nagel > > <wastl.na...@googlemail.com.invalid> wrote: > > > >> Hi, > >> > >> there is an exception "Connection pool shut down". > >> Which version of Solr are you running? It should be > >> Solr 5.5.0 for Nutch 1.13. > >> > >> Sebastian > >> > >> On 07/20/2018 03:58 PM, Rushi wrote: > >>> Thanks for the response Sebastian, > >>> Yeah i changed my seeds and i am using Nutch 1.13 > >>> > >>> Here is the log > >>> 2018-07-20 09:45:49,769 INFO crawl.Injector - Injector: starting at > >>> 2018-07-20 09:45:49 > >>> 2018-07-20 09:45:49,770 INFO crawl.Injector - Injector: crawlDb: > >>> TestCra7sl/crawldb > >>> 2018-07-20 09:45:49,770 INFO crawl.Injector - Injector: urlDir: urls > >>> 2018-07-20 09:45:49,770 INFO crawl.Injector - Injector: Converting > >>> injected urls to crawl db entries. > >>> 2018-07-20 09:45:49,894 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:45:51,672 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:45:51,688 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:45:51,759 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:45:51,839 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'inject', using default > >>> 2018-07-20 09:45:51,985 INFO crawl.Injector - Injector: overwrite: > false > >>> 2018-07-20 09:45:51,985 INFO crawl.Injector - Injector: update: false > >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total urls > >>> rejected by filters: 0 > >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total urls > >>> injected after normalization and filtering: 1 > >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total urls > >>> injected but already in CrawlDb: 0 > >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: Total new urls > >>> injected: 1 > >>> 2018-07-20 09:45:52,330 INFO crawl.Injector - Injector: finished at > >>> 2018-07-20 09:45:52, elapsed: 00:00:02 > >>> 2018-07-20 09:45:53,235 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:45:53,374 INFO crawl.Generator - Generator: starting at > >>> 2018-07-20 09:45:53 > >>> 2018-07-20 09:45:53,374 INFO crawl.Generator - Generator: Selecting > >>> best-scoring urls due for fetch. > >>> 2018-07-20 09:45:53,374 INFO crawl.Generator - Generator: filtering: > >> false > >>> 2018-07-20 09:45:53,375 INFO crawl.Generator - Generator: normalizing: > >> true > >>> 2018-07-20 09:45:53,375 INFO crawl.Generator - Generator: topN: 50000 > >>> 2018-07-20 09:45:54,084 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:45:54,088 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:45:54,109 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:45:54,146 INFO crawl.FetchScheduleFactory - Using > >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>> 2018-07-20 09:45:54,147 INFO crawl.AbstractFetchSchedule - > >>> defaultInterval=2592000 > >>> 2018-07-20 09:45:54,147 INFO crawl.AbstractFetchSchedule - > >>> maxInterval=7776000 > >>> 2018-07-20 09:45:54,154 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'partition', using default > >>> 2018-07-20 09:45:54,233 INFO crawl.FetchScheduleFactory - Using > >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>> 2018-07-20 09:45:54,233 INFO crawl.AbstractFetchSchedule - > >>> defaultInterval=2592000 > >>> 2018-07-20 09:45:54,233 INFO crawl.AbstractFetchSchedule - > >>> maxInterval=7776000 > >>> 2018-07-20 09:45:54,243 INFO crawl.FetchScheduleFactory - Using > >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>> 2018-07-20 09:45:54,243 INFO crawl.AbstractFetchSchedule - > >>> defaultInterval=2592000 > >>> 2018-07-20 09:45:54,243 INFO crawl.AbstractFetchSchedule - > >>> maxInterval=7776000 > >>> 2018-07-20 09:45:54,244 INFO regex.RegexURLNormalizer - can't find > rules > >>> for scope 'generate_host_count', using default > >>> 2018-07-20 09:45:54,915 INFO crawl.Generator - Generator: Partitioning > >>> selected urls for politeness. > >>> 2018-07-20 09:45:55,916 INFO crawl.Generator - Generator: segment: > >>> TestCra7sl/segments/20180720094555 > >>> 2018-07-20 09:45:57,096 INFO crawl.Generator - Generator: finished at > >>> 2018-07-20 09:45:57, elapsed: 00:00:03 > >>> 2018-07-20 09:45:57,928 INFO fetcher.Fetcher - Fetcher: starting at > >>> 2018-07-20 09:45:57 > >>> 2018-07-20 09:45:57,929 INFO fetcher.Fetcher - Fetcher: segment: > >>> TestCra7sl/segments/20180720094555 > >>> 2018-07-20 09:45:57,929 INFO fetcher.Fetcher - Fetcher Timelimit set > >> for : > >>> 1532105157929 > >>> 2018-07-20 09:45:58,073 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:45:58,800 INFO fetcher.FetchItemQueues - Using queue > mode > >> : > >>> byHost > >>> 2018-07-20 09:45:58,800 INFO fetcher.Fetcher - Fetcher: threads: 50 > >>> 2018-07-20 09:45:58,800 INFO fetcher.Fetcher - Fetcher: time-out > >> divisor: 2 > >>> 2018-07-20 09:45:58,804 INFO fetcher.QueueFeeder - QueueFeeder > finished: > >>> total 1 records + hit by time limit :0 > >>> 2018-07-20 09:45:58,852 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:45:58,855 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:45:58,875 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:45:58,901 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,917 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,917 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - fetching > >>> https://dev-abc.com/letters (queue crawl delay=30000ms) > >>> 2018-07-20 09:45:58,918 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,918 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,918 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,919 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,919 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,919 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,919 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,920 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,920 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,921 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,921 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,921 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,921 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,922 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,922 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,922 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,923 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,923 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,923 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,923 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,924 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,924 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,924 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,925 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,925 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,925 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,925 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,926 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,926 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,926 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,926 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,927 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,927 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,927 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,927 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,928 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,928 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,928 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,928 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,929 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,929 INFO protocol.RobotRulesParser - robots.txt > >>> whitelist not configured. > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,929 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,929 INFO http.Http - http.proxy.host = null > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,929 INFO http.Http - http.proxy.port = 8080 > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,929 INFO http.Http - http.proxy.exception.list = > >> false > >>> 2018-07-20 09:45:58,929 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,929 INFO http.Http - http.timeout = 10000 > >>> 2018-07-20 09:45:58,930 INFO http.Http - http.content.limit = -1 > >>> 2018-07-20 09:45:58,930 INFO http.Http - http.agent = > >>> nutch-solr-integration/Nutch-1.13-SNAPSHOT > >>> 2018-07-20 09:45:58,930 INFO http.Http - http.accept.language = > >>> en-us,en-gb,en;q=0.7,*;q=0.3 > >>> 2018-07-20 09:45:58,930 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,936 INFO http.Http - http.accept = > >>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,936 INFO http.Http - http.enable.cookie.header = > >> true > >>> 2018-07-20 09:45:58,936 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,936 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,937 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,937 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,937 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,937 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,938 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,938 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,938 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,938 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,939 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,939 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,939 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,939 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,940 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,940 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,940 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,940 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,941 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - Using queue mode > : > >>> byHost > >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:45:58,941 INFO fetcher.Fetcher - Fetcher: throughput > >>> threshold: -1 > >>> 2018-07-20 09:45:58,941 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=1 > >>> 2018-07-20 09:45:58,941 INFO fetcher.Fetcher - Fetcher: throughput > >>> threshold retries: 5 > >>> 2018-07-20 09:45:58,941 INFO fetcher.Fetcher - fetcher.maxNum.threads > >>> can't be < than 50 : using 50 instead > >>> 2018-07-20 09:45:59,945 INFO fetcher.Fetcher - -activeThreads=1, > >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 > >>> 2018-07-20 09:46:00,951 INFO fetcher.Fetcher - -activeThreads=1, > >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 > >>> 2018-07-20 09:46:01,955 INFO fetcher.Fetcher - -activeThreads=1, > >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 > >>> 2018-07-20 09:46:02,957 INFO fetcher.Fetcher - -activeThreads=1, > >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 > >>> 2018-07-20 09:46:03,959 INFO fetcher.Fetcher - -activeThreads=1, > >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=1 > >>> 2018-07-20 09:46:04,756 INFO fetcher.FetcherThread - Thread > >> FetcherThread > >>> has no more work available > >>> 2018-07-20 09:46:04,756 INFO fetcher.FetcherThread - -finishing thread > >>> FetcherThread, activeThreads=0 > >>> 2018-07-20 09:46:04,964 INFO fetcher.Fetcher - -activeThreads=0, > >>> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 > >>> 2018-07-20 09:46:04,964 INFO fetcher.Fetcher - -activeThreads=0 > >>> 2018-07-20 09:46:05,709 INFO fetcher.Fetcher - Fetcher: finished at > >>> 2018-07-20 09:46:05, elapsed: 00:00:07 > >>> 2018-07-20 09:46:06,597 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:46:06,712 INFO parse.ParseSegment - ParseSegment: > starting > >>> at 2018-07-20 09:46:06 > >>> 2018-07-20 09:46:06,713 INFO parse.ParseSegment - ParseSegment: > segment: > >>> TestCra7sl/segments/20180720094555 > >>> 2018-07-20 09:46:07,478 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:46:07,482 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:46:07,502 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:46:07,690 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:46:07,775 INFO net.URLExemptionFilters - Found 0 > >> extensions > >>> at point:'org.apache.nutch.net.URLExemptionFilter' > >>> 2018-07-20 09:46:08,294 INFO parse.ParseSegment - ParseSegment: > finished > >>> at 2018-07-20 09:46:08, elapsed: 00:00:01 > >>> 2018-07-20 09:46:09,234 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: starting > at > >>> 2018-07-20 09:46:09 > >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: db: > >>> TestCra7sl/crawldb > >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: segments: > >>> [TestCra7sl/segments/20180720094555] > >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: additions > >>> allowed: false > >>> 2018-07-20 09:46:09,391 INFO crawl.CrawlDb - CrawlDb update: URL > >>> normalizing: false > >>> 2018-07-20 09:46:09,392 INFO crawl.CrawlDb - CrawlDb update: URL > >>> filtering: false > >>> 2018-07-20 09:46:09,392 INFO crawl.CrawlDb - CrawlDb update: 404 > >> purging: > >>> false > >>> 2018-07-20 09:46:09,393 INFO crawl.CrawlDb - CrawlDb update: Merging > >>> segment data into db. > >>> 2018-07-20 09:46:10,518 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:46:10,522 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:46:10,541 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:46:10,567 INFO crawl.FetchScheduleFactory - Using > >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>> 2018-07-20 09:46:10,567 INFO crawl.AbstractFetchSchedule - > >>> defaultInterval=2592000 > >>> 2018-07-20 09:46:10,567 INFO crawl.AbstractFetchSchedule - > >>> maxInterval=7776000 > >>> 2018-07-20 09:46:10,616 INFO crawl.FetchScheduleFactory - Using > >>> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule > >>> 2018-07-20 09:46:10,616 INFO crawl.AbstractFetchSchedule - > >>> defaultInterval=2592000 > >>> 2018-07-20 09:46:10,616 INFO crawl.AbstractFetchSchedule - > >>> maxInterval=7776000 > >>> 2018-07-20 09:46:11,005 INFO crawl.CrawlDb - CrawlDb update: finished > at > >>> 2018-07-20 09:46:11, elapsed: 00:00:01 > >>> 2018-07-20 09:46:11,980 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: starting at > >> 2018-07-20 > >>> 09:46:12 > >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: linkdb: > >>> TestCra7sl/linkdb > >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: URL normalize: > true > >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: URL filter: true > >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: internal links > will > >> be > >>> ignored. > >>> 2018-07-20 09:46:12,132 INFO crawl.LinkDb - LinkDb: adding segment: > >>> TestCra7sl/segments/20180720094555 > >>> 2018-07-20 09:46:12,922 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:46:12,926 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:46:12,946 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:46:13,726 INFO crawl.LinkDb - LinkDb: finished at > >> 2018-07-20 > >>> 09:46:13, elapsed: 00:00:01 > >>> 2018-07-20 09:46:14,596 INFO crawl.DeduplicationJob - > DeduplicationJob: > >>> starting at 2018-07-20 09:46:14 > >>> 2018-07-20 09:46:14,785 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:46:16,448 INFO crawl.DeduplicationJob - Deduplication: 0 > >>> documents marked as duplicates > >>> 2018-07-20 09:46:16,449 INFO crawl.DeduplicationJob - Deduplication: > >>> Updating status of duplicate urls into crawl db. > >>> 2018-07-20 09:46:17,665 INFO crawl.DeduplicationJob - Deduplication > >>> finished at 2018-07-20 09:46:17, elapsed: 00:00:03 > >>> 2018-07-20 09:46:18,637 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:46:18,776 INFO segment.SegmentChecker - Segment dir is > >>> complete: TestCra7sl/segments/20180720094555. > >>> 2018-07-20 09:46:18,777 INFO indexer.IndexingJob - Indexer: starting > at > >>> 2018-07-20 09:46:18 > >>> 2018-07-20 09:46:18,780 INFO indexer.IndexingJob - Indexer: deleting > >> gone > >>> documents: false > >>> 2018-07-20 09:46:18,780 INFO indexer.IndexingJob - Indexer: URL > >> filtering: > >>> false > >>> 2018-07-20 09:46:18,780 INFO indexer.IndexingJob - Indexer: URL > >>> normalizing: false > >>> 2018-07-20 09:46:18,848 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:46:18,853 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:46:18,877 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:46:18,947 INFO indexer.IndexWriters - Adding > >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter > >>> 2018-07-20 09:46:18,947 INFO indexer.IndexingJob - Active > IndexWriters : > >>> SOLRIndexWriter > >>> solr.server.url : URL of the SOLR instance > >>> solr.zookeeper.hosts : URL of the Zookeeper quorum > >>> solr.commit.size : buffer size when sending to SOLR (default 1000) > >>> solr.mapping.file : name of the mapping file for fields (default > >>> solrindex-mapping.xml) > >>> solr.auth : use authentication (default false) > >>> solr.auth.username : username for authentication > >>> solr.auth.password : password for authentication > >>> > >>> > >>> 2018-07-20 09:46:18,949 INFO indexer.IndexerMapReduce - > >> IndexerMapReduce: > >>> crawldb: TestCra7sl/crawldb > >>> 2018-07-20 09:46:18,949 INFO indexer.IndexerMapReduce - > >> IndexerMapReduce: > >>> linkdb: TestCra7sl/linkdb > >>> 2018-07-20 09:46:18,949 INFO indexer.IndexerMapReduce - > >> IndexerMapReduces: > >>> adding segment: TestCra7sl/segments/20180720094555 > >>> 2018-07-20 09:46:19,781 INFO anchor.AnchorIndexingFilter - Anchor > >>> deduplication is: off > >>> 2018-07-20 09:46:20,716 INFO indexer.IndexWriters - Adding > >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: content > >>> dest: content > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: title > >> dest: > >>> title > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: > >>> metatag.description dest: description > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: > >>> metatag.section dest: section > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: > >>> metatag.gldocname dest: gldocname > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: host > dest: > >>> host > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: segment > >>> dest: segment > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: boost > >> dest: > >>> boost > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: digest > >> dest: > >>> digest > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: tstamp > >> dest: > >>> tstamp > >>> 2018-07-20 09:46:20,809 INFO solr.SolrMappingReader - source: > >> mainContent > >>> dest: mainContent > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: content > >>> dest: content > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: title > >> dest: > >>> title > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: > >>> metatag.description dest: description > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: > >>> metatag.section dest: section > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: > >>> metatag.gldocname dest: gldocname > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: host > dest: > >>> host > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: segment > >>> dest: segment > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: boost > >> dest: > >>> boost > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: digest > >> dest: > >>> digest > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: tstamp > >> dest: > >>> tstamp > >>> 2018-07-20 09:46:21,636 INFO solr.SolrMappingReader - source: > >> mainContent > >>> dest: mainContent > >>> 2018-07-20 09:46:21,646 INFO indexer.IndexingJob - Indexer: number of > >>> documents indexed, deleted, or skipped: > >>> 2018-07-20 09:46:21,652 INFO indexer.IndexingJob - Indexer: finished > at > >>> 2018-07-20 09:46:21, elapsed: 00:00:02 > >>> 2018-07-20 09:46:22,551 INFO indexer.CleaningJob - CleaningJob: > starting > >>> at 2018-07-20 09:46:22 > >>> 2018-07-20 09:46:22,717 WARN util.NativeCodeLoader - Unable to load > >>> native-hadoop library for your platform... using builtin-java classes > >> where > >>> applicable > >>> 2018-07-20 09:46:23,395 WARN output.FileOutputCommitter - Output Path > is > >>> null in setupJob() > >>> 2018-07-20 09:46:23,644 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: > >>> /nutch/plugins/plugin/plugin.xml (No such file or directory) > >>> 2018-07-20 09:46:23,649 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > >>> java.io.FileNotFoundException: > /nutch/plugins/publish-rabitmq/plugin.xml > >>> (No such file or directory) > >>> 2018-07-20 09:46:23,669 WARN plugin.PluginRepository - Error while > >> loading > >>> plugin `/nutch/plugins/parse-replace/plugin.xml` > >>> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml > >> (No > >>> such file or directory) > >>> 2018-07-20 09:46:23,702 INFO indexer.IndexWriters - Adding > >>> org.apache.nutch.indexwriter.solr.SolrIndexWriter > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: content > >>> dest: content > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: title > >> dest: > >>> title > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: > >>> metatag.description dest: description > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: > >>> metatag.section dest: section > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: > >>> metatag.gldocname dest: gldocname > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: host > dest: > >>> host > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: segment > >>> dest: segment > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: boost > >> dest: > >>> boost > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: digest > >> dest: > >>> digest > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: tstamp > >> dest: > >>> tstamp > >>> 2018-07-20 09:46:23,766 INFO solr.SolrMappingReader - source: > >> mainContent > >>> dest: mainContent > >>> 2018-07-20 09:46:23,769 INFO solr.SolrIndexWriter - SolrIndexer: > >> deleting > >>> 1/1 documents > >>> 2018-07-20 09:46:23,909 WARN output.FileOutputCommitter - Output Path > is > >>> null in cleanupJob() > >>> 2018-07-20 09:46:23,910 WARN mapred.LocalJobRunner - > >>> job_local1584437722_0001 > >>> java.lang.Exception: java.lang.IllegalStateException: Connection pool > >> shut > >>> down > >>> at > >>> > >> > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > >>> at > >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > >>> Caused by: java.lang.IllegalStateException: Connection pool shut down > >>> at org.apache.http.util.Asserts.check(Asserts.java:34) > >>> at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169) > >>> at > org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202) > >>> at > >>> > >> > org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184) > >>> at > >>> > >> > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415) > >>> at > >>> > >> > org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > >>> at > >>> > >> > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > >>> at > >>> > >> > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) > >>> at > >>> > >> > org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:481) > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240) > >>> at > >>> > >> > org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229) > >>> at > org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149) > >>> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482) > >>> at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463) > >>> at > >>> > >> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191) > >>> at > >>> > >> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179) > >>> at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117) > >>> at > >>> > >> > org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:122) > >>> at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:244) > >>> at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459) > >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > >>> at > >>> > >> > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > >>> at > >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > >>> at > >>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > >>> at java.lang.Thread.run(Thread.java:748) > >>> 2018-07-20 09:46:24,406 ERROR indexer.CleaningJob - CleaningJob: > >>> java.io.IOException: Job failed! > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) > >>> at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:174) > >>> at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:197) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > >>> at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:208) > >>> > >>> On Fri, Jul 20, 2018 at 2:06 AM Sebastian Nagel > >>> <wastl.na...@googlemail.com.invalid> wrote: > >>> > >>>> Hi, > >>>> > >>>>> * Changed my regex-filter to use development domain address. > >>>> > >>>> Did you also change your seeds? > >>>> > >>>> The fact that deletions are sent but not additions/updates > >>>> suggests that no pages have been successfully crawled. > >>>> > >>>> Could you specify the Nutch version used and also attach some > >>>> log snippets to make it possible to analyze the issue. > >>>> > >>>> Thanks, > >>>> Sebastian > >>>> > >>>> On 07/19/2018 10:30 PM, Rushi wrote: > >>>>> Hi all, > >>>>> I was using nutch from last 6 months and it works with Production > urls > >>>> with out any issue and for > >>>>> testing purpose i want make this work on Dev/staging.I followed these > >>>> steps > >>>>> > >>>>> > >>>>> And ran this command > >>>>> > >>>>> ./bin/crawl -i -D solr.server.url= > >>>> http://dev.abc.com:8983/solr/automateindex/ urls/ Testcrawl 2 > >>>>> > >>>>> I dont see that it is indexed but it shows that it is deleted. > >>>>> > >>>>> > >>>>> nutch_—_-bash_—_206×67.png > >>>>> > >>>>> Note:I tried checking the Dev url with bin/nutch indexchecker > >>>> https://dev-abc.com/letters it shows > >>>>> me the content. > >>>>> > >>>>> I would really appreciate suggest me a solution. > >>>>> > >>>> > >>>> > >>> > >> > >> > >> > > > > -- Regards Rushikesh M .Net Developer