Hi, > Map input records=79 > Map output records=0
... and no IndexerJob:DocumentCount counter The map function got 79 records as input, but did not write anything to the indexer. There are a couple of reasons why a document is skipped, e.g., nothing parsed, missing markers, errors in indexing filters, ... Have a look at the map method: https://github.com/apache/nutch/blob/branch-2.3.1/src/java/org/apache/nutch/indexer/IndexingJob.java#L95 and start debugging it. Alternatively, check your table and the log files of the previous steps. There must be a reason why nothing is indexed. Best, Sebastian On 03/02/2018 01:03 PM, Yash Thenuan Thenuan wrote: > I got this after setting log4j.logger.org.apache.hadoop to info > > 2018-03-02 17:29:40,157 INFO indexer.IndexingJob - IndexingJob: starting > 2018-03-02 17:29:40,775 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2018-03-02 17:29:40,853 INFO Configuration.deprecation - > mapred.output.key.comparator.class is deprecated. Instead, use > mapreduce.job.output.key.comparator.class > 2018-03-02 17:29:41,073 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: -1 > 2018-03-02 17:29:41,073 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2018-03-02 17:29:41,076 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2018-03-02 17:29:41,076 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2018-03-02 17:29:41,094 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.metadata.MetadataIndexer > 2018-03-02 17:29:41,465 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.more.MoreIndexingFilter > 2018-03-02 17:29:42,585 INFO Configuration.deprecation - session.id is > deprecated. Instead, use dfs.metrics.session-id > 2018-03-02 17:29:42,587 INFO jvm.JvmMetrics - Initializing JVM Metrics > with processName=JobTracker, sessionId= > 2018-03-02 17:29:43,277 INFO mapreduce.JobSubmitter - number of splits:1 > 2018-03-02 17:29:43,501 INFO mapreduce.JobSubmitter - Submitting tokens > for job: job_local1792747860_0001 > 2018-03-02 17:29:43,566 WARN conf.Configuration - > file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2018-03-02 17:29:43,570 WARN conf.Configuration - > file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.attempts; Ignoring. > 2018-03-02 17:29:43,726 WARN conf.Configuration - > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.retry.interval; Ignoring. > 2018-03-02 17:29:43,731 WARN conf.Configuration - > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an > attempt to override final parameter: > mapreduce.job.end-notification.max.attempts; Ignoring. > 2018-03-02 17:29:43,755 INFO mapreduce.Job - The url to track the job: > http://localhost:8080/ > 2018-03-02 17:29:43,757 INFO mapreduce.Job - Running job: > job_local1792747860_0001 > 2018-03-02 17:29:43,757 INFO mapred.LocalJobRunner - OutputCommitter set > in config null > 2018-03-02 17:29:43,767 INFO mapred.LocalJobRunner - OutputCommitter is > org.apache.nutch.indexer.IndexerOutputFormat$2 > 2018-03-02 17:29:43,838 INFO mapred.LocalJobRunner - Waiting for map tasks > 2018-03-02 17:29:43,841 INFO mapred.LocalJobRunner - Starting task: > attempt_local1792747860_0001_m_000000_0 > 2018-03-02 17:29:43,899 INFO util.ProcfsBasedProcessTree - > ProcfsBasedProcessTree currently is supported only on Linux. > 2018-03-02 17:29:43,899 INFO mapred.Task - Using > ResourceCalculatorProcessTree : null > 2018-03-02 17:29:43,923 INFO mapred.MapTask - Processing split: > org.apache.gora.mapreduce.GoraInputSplit@424b7f03 > 2018-03-02 17:29:44,051 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > 2018-03-02 17:29:44,767 INFO mapreduce.Job - Job job_local1792747860_0001 > running in uber mode : false > 2018-03-02 17:29:44,769 INFO mapreduce.Job - map 0% reduce 0% > 2018-03-02 17:29:50,926 INFO basic.BasicIndexingFilter - Maximum title > length for indexing set to: -1 > 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.basic.BasicIndexingFilter > 2018-03-02 17:29:50,926 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: off > 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.metadata.MetadataIndexer > 2018-03-02 17:29:50,927 INFO indexer.IndexingFilters - Adding > org.apache.nutch.indexer.more.MoreIndexingFilter > 2018-03-02 17:29:51,153 INFO mapred.LocalJobRunner - > 2018-03-02 17:29:52,782 INFO mapred.Task - > Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process > of committing > 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map > 2018-03-02 17:29:52,825 INFO mapred.Task - Task > 'attempt_local1792747860_0001_m_000000_0' done. > 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - Finishing task: > attempt_local1792747860_0001_m_000000_0 > 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map task executor > complete. > 2018-03-02 17:29:53,791 INFO mapreduce.Job - map 100% reduce 0% > 2018-03-02 17:29:53,791 INFO mapreduce.Job - Job job_local1792747860_0001 > completed successfully > 2018-03-02 17:29:53,849 INFO mapreduce.Job - Counters: 15 > File System Counters > FILE: Number of bytes read=610359 > FILE: Number of bytes written=891634 > FILE: Number of read operations=0 > FILE: Number of large read operations=0 > FILE: Number of write operations=0 > Map-Reduce Framework > Map input records=79 > Map output records=0 > Input split bytes=995 > Spilled Records=0 > Failed Shuffles=0 > Merged Map outputs=0 > GC time elapsed (ms)=103 > Total committed heap usage (bytes)=225443840 > File Input Format Counters > Bytes Read=0 > File Output Format Counters > Bytes Written=0 > 2018-03-02 17:29:53,866 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > 2018-03-02 17:29:53,866 INFO indexer.IndexingJob - Active IndexWriters : > ElasticIndexWriter > elastic.cluster : elastic prefix cluster > elastic.host : hostname > elastic.port : port (default 9200) > elastic.index : elastic index command > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) > elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB) > > > 2018-03-02 17:29:53,925 INFO indexer.IndexingJob - IndexingJob: done. > > > On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wastl.na...@googlemail.com> > wrote: > >> Hi, >> >> looks more like that there is nothing to index. >> >> Unfortunately, in 2.x there are no log messages >> on by default which indicate how many documents >> are sent to the index back-ends. >> >> The easiest way is to enable Job counters in >> conf/log4j.properties by adding the line: >> >> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO >> >> or setting the level to INFO for >> >> log4j.logger.org.apache.hadoop=WARN >> >> Make sure the log4j.properties is correctly deployed >> (in doubt, run "ant runtime"). Then check the hadoop.log >> again: there should be a counter DocumentCount with non-zero >> value. >> >> Best, >> Sebastian >> >> >> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote: >>> Following are the logs from hadoop.log >>> >>> 2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting >>> 2018-03-02 11:18:45,791 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2018-03-02 11:18:46,138 INFO basic.BasicIndexingFilter - Maximum title >>> length for indexing set to: -1 >>> 2018-03-02 11:18:46,138 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.basic.BasicIndexingFilter >>> 2018-03-02 11:18:46,140 INFO anchor.AnchorIndexingFilter - Anchor >>> deduplication is: off >>> 2018-03-02 11:18:46,140 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter >>> 2018-03-02 11:18:46,157 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.metadata.MetadataIndexer >>> 2018-03-02 11:18:46,535 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.more.MoreIndexingFilter >>> 2018-03-02 11:18:48,663 WARN conf.Configuration - >>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/. >> staging/job_local1100834069_0001/job.xml:an >>> attempt to override final parameter: >>> mapreduce.job.end-notification.max.retry.interval; Ignoring. >>> 2018-03-02 11:18:48,666 WARN conf.Configuration - >>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/. >> staging/job_local1100834069_0001/job.xml:an >>> attempt to override final parameter: >>> mapreduce.job.end-notification.max.attempts; Ignoring. >>> 2018-03-02 11:18:48,792 WARN conf.Configuration - >>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_ >> local1100834069_0001/job_local1100834069_0001.xml:an >>> attempt to override final parameter: >>> mapreduce.job.end-notification.max.retry.interval; Ignoring. >>> 2018-03-02 11:18:48,798 WARN conf.Configuration - >>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_ >> local1100834069_0001/job_local1100834069_0001.xml:an >>> attempt to override final parameter: >>> mapreduce.job.end-notification.max.attempts; Ignoring. >>> 2018-03-02 11:18:49,093 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter >>> 2018-03-02 11:18:54,737 INFO basic.BasicIndexingFilter - Maximum title >>> length for indexing set to: -1 >>> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.basic.BasicIndexingFilter >>> 2018-03-02 11:18:54,737 INFO anchor.AnchorIndexingFilter - Anchor >>> deduplication is: off >>> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter >>> 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.metadata.MetadataIndexer >>> 2018-03-02 11:18:54,738 INFO indexer.IndexingFilters - Adding >>> org.apache.nutch.indexer.more.MoreIndexingFilter >>> 2018-03-02 11:18:56,883 INFO indexer.IndexWriters - Adding >>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter >>> 2018-03-02 11:18:56,884 INFO indexer.IndexingJob - Active IndexWriters : >>> ElasticIndexWriter >>> elastic.cluster : elastic prefix cluster >>> elastic.host : hostname >>> elastic.port : port (default 9200) >>> elastic.index : elastic index command >>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) >>> elastic.max.bulk.size : elastic bulk index length. (default 2500500 >> ~2.5MB) >>> >>> >>> 2018-03-02 11:18:56,939 INFO indexer.IndexingJob - IndexingJob: done. >>> >>> >>> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel < >> wastl.na...@googlemail.com >>>> wrote: >>> >>>> It's impossible to find the reason from console output. >>>> Please check the hadoop.log, it should contain more logs >>>> including those from ElasticIndexWriter. >>>> >>>> Sebastian >>>> >>>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote: >>>>> Hi Sebastian All of this is coming but the problem is,The content is >> not >>>>> sent sent.Nothing is indexed to es. >>>>> This is the output on debug level. >>>>> >>>>> ElasticIndexWriter >>>>> >>>>> elastic.cluster : elastic prefix cluster >>>>> >>>>> elastic.host : hostname >>>>> >>>>> elastic.port : port (default 9200) >>>>> >>>>> elastic.index : elastic index command >>>>> >>>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) >>>>> >>>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500 >>>> ~2.5MB) >>>>> >>>>> >>>>> no modules loaded >>>>> >>>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin] >>>>> >>>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin] >>>>> >>>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin] >>>>> >>>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin] >>>>> >>>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin] >>>>> >>>>> created thread pool: name [force_merge], size [1], queue size >> [unbounded] >>>>> >>>>> created thread pool: name [fetch_shard_started], core [1], max [8], >> keep >>>>> alive [5m] >>>>> >>>>> created thread pool: name [listener], size [2], queue size [unbounded] >>>>> >>>>> created thread pool: name [index], size [4], queue size [200] >>>>> >>>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m] >>>>> >>>>> created thread pool: name [generic], core [4], max [128], keep alive >>>> [30s] >>>>> >>>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m] >>>>> >>>>> thread pool [search] will adjust queue by [50] when determining >> automatic >>>>> queue size >>>>> >>>>> created thread pool: name [search], size [7], queue size [1k] >>>>> >>>>> created thread pool: name [flush], core [1], max [2], keep alive [5m] >>>>> >>>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep >>>>> alive [5m] >>>>> >>>>> created thread pool: name [management], core [1], max [5], keep alive >>>> [5m] >>>>> >>>>> created thread pool: name [get], size [4], queue size [1k] >>>>> >>>>> created thread pool: name [bulk], size [4], queue size [200] >>>>> >>>>> created thread pool: name [snapshot], core [1], max [2], keep alive >> [5m] >>>>> >>>>> node_sampler_interval[5s] >>>>> >>>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{ >>>>> 127.0.0.1:9300}] >>>>> >>>>> connected to node >>>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{ >>>>> 127.0.0.1:9300}] >>>>> >>>>> IndexingJob: done >>>>> >>>>> >>>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel < >>>>> wastl.na...@googlemail.com> wrote: >>>>> >>>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as >> for >>>>>> 1.x: >>>>>> >>>>>> - enable the plugin "indexer-elastic" in plugin.includes >>>>>> (upgrade and rename to "indexer-elastic2" in 2.4) >>>>>> >>>>>> - expects ES 1.4.1 >>>>>> >>>>>> - available/required options are found in the log file (hadoop.log): >>>>>> ElasticIndexWriter >>>>>> elastic.cluster : elastic prefix cluster >>>>>> elastic.host : hostname >>>>>> elastic.port : port (default 9300) >>>>>> elastic.index : elastic index command >>>>>> elastic.max.bulk.docs : elastic bulk index doc counts. >> (default >>>>>> 250) >>>>>> elastic.max.bulk.size : elastic bulk index length. (default >>>>>> 2500500 ~2.5MB) >>>>>> >>>>>> Sebastian >>>>>> >>>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote: >>>>>>> Yeah >>>>>>> I was also thinking that >>>>>>> Can somebody help me with nutch 2.3? >>>>>>> >>>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: >>>>>>> >>>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering >>>> for >>>>>>>> Nutch 1.x. I'm afraid I can't help you. >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] >>>>>>>>> Sent: 28 February 2018 14:20 >>>>>>>>> To: user@nutch.apache.org >>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch >>>>>>>>> >>>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is >> the >>>>>>>> output of >>>>>>>>> nutch index i have already configured the nutch-site.xml. >>>>>>>>> >>>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com> >> wrote: >>>>>>>>> >>>>>>>>>> I suggest you run "nutch index", take a look at the returned help >>>>>>>>>> message, and continue from there. >>>>>>>>>> Broadly, first of all you need to configure your elasticsearch >>>>>>>>>> environment in nutch-site.xml, and then you need to run nutch >> index >>>>>>>>>> with the location of your CrawlDB and either the segment you want >> to >>>>>>>>>> index or the directory that contains all the segments you want to >>>>>>>> index. >>>>>>>>>> >>>>>>>>>>> -----Original Message----- >>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] >>>>>>>>>>> Sent: 28 February 2018 14:06 >>>>>>>>>>> To: user@nutch.apache.org >>>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch >>>>>>>>>>> >>>>>>>>>>> All I want is to index my parsed data to elasticsearch. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yossi.tam...@pipl.com> >>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Yash, >>>>>>>>>>> >>>>>>>>>>> The nutch index command does not have a -all flag, so I'm not >> sure >>>>>>>>>>> what >>>>>>>>>> you're >>>>>>>>>>> trying to achieve here. >>>>>>>>>>> >>>>>>>>>>> Yossi. >>>>>>>>>>> >>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] >>>>>>>>>>>> Sent: 28 February 2018 13:55 >>>>>>>>>>>> To: user@nutch.apache.org >>>>>>>>>>>> Subject: Regarding Indexing to elasticsearch >>>>>>>>>>>> >>>>>>>>>>>> Can somebody please tell me what happens when we hit the >> bin/nutc >>>>>>>>>>>> index >>>>>>>>>>> -all >>>>>>>>>>>> command. >>>>>>>>>>>> Because I can't figure out why the write function inside the >>>>>>>>>>> elastic-indexer is not >>>>>>>>>>>> getting executed. >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> >