I got this after setting log4j.logger.org.apache.hadoop to info 2018-03-02 17:29:40,157 INFO indexer.IndexingJob - IndexingJob: starting 2018-03-02 17:29:40,775 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-03-02 17:29:40,853 INFO Configuration.deprecation - mapred.output.key.comparator.class is deprecated. Instead, use mapreduce.job.output.key.comparator.class 2018-03-02 17:29:41,073 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: -1 2018-03-02 17:29:41,073 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2018-03-02 17:29:41,076 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2018-03-02 17:29:41,076 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2018-03-02 17:29:41,094 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2018-03-02 17:29:41,465 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2018-03-02 17:29:42,585 INFO Configuration.deprecation - session.id is deprecated. Instead, use dfs.metrics.session-id 2018-03-02 17:29:42,587 INFO jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId= 2018-03-02 17:29:43,277 INFO mapreduce.JobSubmitter - number of splits:1 2018-03-02 17:29:43,501 INFO mapreduce.JobSubmitter - Submitting tokens for job: job_local1792747860_0001 2018-03-02 17:29:43,566 WARN conf.Configuration - file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2018-03-02 17:29:43,570 WARN conf.Configuration - file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2018-03-02 17:29:43,726 WARN conf.Configuration - file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 2018-03-02 17:29:43,731 WARN conf.Configuration - file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 2018-03-02 17:29:43,755 INFO mapreduce.Job - The url to track the job: http://localhost:8080/ 2018-03-02 17:29:43,757 INFO mapreduce.Job - Running job: job_local1792747860_0001 2018-03-02 17:29:43,757 INFO mapred.LocalJobRunner - OutputCommitter set in config null 2018-03-02 17:29:43,767 INFO mapred.LocalJobRunner - OutputCommitter is org.apache.nutch.indexer.IndexerOutputFormat$2 2018-03-02 17:29:43,838 INFO mapred.LocalJobRunner - Waiting for map tasks 2018-03-02 17:29:43,841 INFO mapred.LocalJobRunner - Starting task: attempt_local1792747860_0001_m_000000_0 2018-03-02 17:29:43,899 INFO util.ProcfsBasedProcessTree - ProcfsBasedProcessTree currently is supported only on Linux. 2018-03-02 17:29:43,899 INFO mapred.Task - Using ResourceCalculatorProcessTree : null 2018-03-02 17:29:43,923 INFO mapred.MapTask - Processing split: org.apache.gora.mapreduce.GoraInputSplit@424b7f03 2018-03-02 17:29:44,051 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2018-03-02 17:29:44,767 INFO mapreduce.Job - Job job_local1792747860_0001 running in uber mode : false 2018-03-02 17:29:44,769 INFO mapreduce.Job - map 0% reduce 0% 2018-03-02 17:29:50,926 INFO basic.BasicIndexingFilter - Maximum title length for indexing set to: -1 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.basic.BasicIndexingFilter 2018-03-02 17:29:50,926 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.anchor.AnchorIndexingFilter 2018-03-02 17:29:50,926 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.metadata.MetadataIndexer 2018-03-02 17:29:50,927 INFO indexer.IndexingFilters - Adding org.apache.nutch.indexer.more.MoreIndexingFilter 2018-03-02 17:29:51,153 INFO mapred.LocalJobRunner - 2018-03-02 17:29:52,782 INFO mapred.Task - Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process of committing 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map 2018-03-02 17:29:52,825 INFO mapred.Task - Task 'attempt_local1792747860_0001_m_000000_0' done. 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - Finishing task: attempt_local1792747860_0001_m_000000_0 2018-03-02 17:29:52,825 INFO mapred.LocalJobRunner - map task executor complete. 2018-03-02 17:29:53,791 INFO mapreduce.Job - map 100% reduce 0% 2018-03-02 17:29:53,791 INFO mapreduce.Job - Job job_local1792747860_0001 completed successfully 2018-03-02 17:29:53,849 INFO mapreduce.Job - Counters: 15 File System Counters FILE: Number of bytes read=610359 FILE: Number of bytes written=891634 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=79 Map output records=0 Input split bytes=995 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=103 Total committed heap usage (bytes)=225443840 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 2018-03-02 17:29:53,866 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.elastic.ElasticIndexWriter 2018-03-02 17:29:53,866 INFO indexer.IndexingJob - Active IndexWriters : ElasticIndexWriter elastic.cluster : elastic prefix cluster elastic.host : hostname elastic.port : port (default 9200) elastic.index : elastic index command elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
2018-03-02 17:29:53,925 INFO indexer.IndexingJob - IndexingJob: done. On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wastl.na...@googlemail.com> wrote: > Hi, > > looks more like that there is nothing to index. > > Unfortunately, in 2.x there are no log messages > on by default which indicate how many documents > are sent to the index back-ends. > > The easiest way is to enable Job counters in > conf/log4j.properties by adding the line: > > log4j.logger.org.apache.hadoop.mapreduce.Job=INFO > > or setting the level to INFO for > > log4j.logger.org.apache.hadoop=WARN > > Make sure the log4j.properties is correctly deployed > (in doubt, run "ant runtime"). Then check the hadoop.log > again: there should be a counter DocumentCount with non-zero > value. > > Best, > Sebastian > > > On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote: > > Following are the logs from hadoop.log > > > > 2018-03-02 11:18:45,220 INFO indexer.IndexingJob - IndexingJob: starting > > 2018-03-02 11:18:45,791 WARN util.NativeCodeLoader - Unable to load > > native-hadoop library for your platform... using builtin-java classes > where > > applicable > > 2018-03-02 11:18:46,138 INFO basic.BasicIndexingFilter - Maximum title > > length for indexing set to: -1 > > 2018-03-02 11:18:46,138 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.basic.BasicIndexingFilter > > 2018-03-02 11:18:46,140 INFO anchor.AnchorIndexingFilter - Anchor > > deduplication is: off > > 2018-03-02 11:18:46,140 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > > 2018-03-02 11:18:46,157 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.metadata.MetadataIndexer > > 2018-03-02 11:18:46,535 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.more.MoreIndexingFilter > > 2018-03-02 11:18:48,663 WARN conf.Configuration - > > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/. > staging/job_local1100834069_0001/job.xml:an > > attempt to override final parameter: > > mapreduce.job.end-notification.max.retry.interval; Ignoring. > > 2018-03-02 11:18:48,666 WARN conf.Configuration - > > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/. > staging/job_local1100834069_0001/job.xml:an > > attempt to override final parameter: > > mapreduce.job.end-notification.max.attempts; Ignoring. > > 2018-03-02 11:18:48,792 WARN conf.Configuration - > > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_ > local1100834069_0001/job_local1100834069_0001.xml:an > > attempt to override final parameter: > > mapreduce.job.end-notification.max.retry.interval; Ignoring. > > 2018-03-02 11:18:48,798 WARN conf.Configuration - > > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_ > local1100834069_0001/job_local1100834069_0001.xml:an > > attempt to override final parameter: > > mapreduce.job.end-notification.max.attempts; Ignoring. > > 2018-03-02 11:18:49,093 INFO indexer.IndexWriters - Adding > > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > > 2018-03-02 11:18:54,737 INFO basic.BasicIndexingFilter - Maximum title > > length for indexing set to: -1 > > 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.basic.BasicIndexingFilter > > 2018-03-02 11:18:54,737 INFO anchor.AnchorIndexingFilter - Anchor > > deduplication is: off > > 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.anchor.AnchorIndexingFilter > > 2018-03-02 11:18:54,737 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.metadata.MetadataIndexer > > 2018-03-02 11:18:54,738 INFO indexer.IndexingFilters - Adding > > org.apache.nutch.indexer.more.MoreIndexingFilter > > 2018-03-02 11:18:56,883 INFO indexer.IndexWriters - Adding > > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter > > 2018-03-02 11:18:56,884 INFO indexer.IndexingJob - Active IndexWriters : > > ElasticIndexWriter > > elastic.cluster : elastic prefix cluster > > elastic.host : hostname > > elastic.port : port (default 9200) > > elastic.index : elastic index command > > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) > > elastic.max.bulk.size : elastic bulk index length. (default 2500500 > ~2.5MB) > > > > > > 2018-03-02 11:18:56,939 INFO indexer.IndexingJob - IndexingJob: done. > > > > > > On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel < > wastl.na...@googlemail.com > >> wrote: > > > >> It's impossible to find the reason from console output. > >> Please check the hadoop.log, it should contain more logs > >> including those from ElasticIndexWriter. > >> > >> Sebastian > >> > >> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote: > >>> Hi Sebastian All of this is coming but the problem is,The content is > not > >>> sent sent.Nothing is indexed to es. > >>> This is the output on debug level. > >>> > >>> ElasticIndexWriter > >>> > >>> elastic.cluster : elastic prefix cluster > >>> > >>> elastic.host : hostname > >>> > >>> elastic.port : port (default 9200) > >>> > >>> elastic.index : elastic index command > >>> > >>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250) > >>> > >>> elastic.max.bulk.size : elastic bulk index length. (default 2500500 > >> ~2.5MB) > >>> > >>> > >>> no modules loaded > >>> > >>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin] > >>> > >>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin] > >>> > >>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin] > >>> > >>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin] > >>> > >>> loaded plugin [org.elasticsearch.transport.Netty4Plugin] > >>> > >>> created thread pool: name [force_merge], size [1], queue size > [unbounded] > >>> > >>> created thread pool: name [fetch_shard_started], core [1], max [8], > keep > >>> alive [5m] > >>> > >>> created thread pool: name [listener], size [2], queue size [unbounded] > >>> > >>> created thread pool: name [index], size [4], queue size [200] > >>> > >>> created thread pool: name [refresh], core [1], max [2], keep alive [5m] > >>> > >>> created thread pool: name [generic], core [4], max [128], keep alive > >> [30s] > >>> > >>> created thread pool: name [warmer], core [1], max [2], keep alive [5m] > >>> > >>> thread pool [search] will adjust queue by [50] when determining > automatic > >>> queue size > >>> > >>> created thread pool: name [search], size [7], queue size [1k] > >>> > >>> created thread pool: name [flush], core [1], max [2], keep alive [5m] > >>> > >>> created thread pool: name [fetch_shard_store], core [1], max [8], keep > >>> alive [5m] > >>> > >>> created thread pool: name [management], core [1], max [5], keep alive > >> [5m] > >>> > >>> created thread pool: name [get], size [4], queue size [1k] > >>> > >>> created thread pool: name [bulk], size [4], queue size [200] > >>> > >>> created thread pool: name [snapshot], core [1], max [2], keep alive > [5m] > >>> > >>> node_sampler_interval[5s] > >>> > >>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{ > >>> 127.0.0.1:9300}] > >>> > >>> connected to node > >>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{ > >>> 127.0.0.1:9300}] > >>> > >>> IndexingJob: done > >>> > >>> > >>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel < > >>> wastl.na...@googlemail.com> wrote: > >>> > >>>> I never tried ES with Nutch 2.3 but it should be similar to setup as > for > >>>> 1.x: > >>>> > >>>> - enable the plugin "indexer-elastic" in plugin.includes > >>>> (upgrade and rename to "indexer-elastic2" in 2.4) > >>>> > >>>> - expects ES 1.4.1 > >>>> > >>>> - available/required options are found in the log file (hadoop.log): > >>>> ElasticIndexWriter > >>>> elastic.cluster : elastic prefix cluster > >>>> elastic.host : hostname > >>>> elastic.port : port (default 9300) > >>>> elastic.index : elastic index command > >>>> elastic.max.bulk.docs : elastic bulk index doc counts. > (default > >>>> 250) > >>>> elastic.max.bulk.size : elastic bulk index length. (default > >>>> 2500500 ~2.5MB) > >>>> > >>>> Sebastian > >>>> > >>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote: > >>>>> Yeah > >>>>> I was also thinking that > >>>>> Can somebody help me with nutch 2.3? > >>>>> > >>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yossi.tam...@pipl.com> wrote: > >>>>> > >>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering > >> for > >>>>>> Nutch 1.x. I'm afraid I can't help you. > >>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] > >>>>>>> Sent: 28 February 2018 14:20 > >>>>>>> To: user@nutch.apache.org > >>>>>>> Subject: RE: Regarding Indexing to elasticsearch > >>>>>>> > >>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is > the > >>>>>> output of > >>>>>>> nutch index i have already configured the nutch-site.xml. > >>>>>>> > >>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com> > wrote: > >>>>>>> > >>>>>>>> I suggest you run "nutch index", take a look at the returned help > >>>>>>>> message, and continue from there. > >>>>>>>> Broadly, first of all you need to configure your elasticsearch > >>>>>>>> environment in nutch-site.xml, and then you need to run nutch > index > >>>>>>>> with the location of your CrawlDB and either the segment you want > to > >>>>>>>> index or the directory that contains all the segments you want to > >>>>>> index. > >>>>>>>> > >>>>>>>>> -----Original Message----- > >>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] > >>>>>>>>> Sent: 28 February 2018 14:06 > >>>>>>>>> To: user@nutch.apache.org > >>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch > >>>>>>>>> > >>>>>>>>> All I want is to index my parsed data to elasticsearch. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yossi.tam...@pipl.com> > >> wrote: > >>>>>>>>> > >>>>>>>>> Hi Yash, > >>>>>>>>> > >>>>>>>>> The nutch index command does not have a -all flag, so I'm not > sure > >>>>>>>>> what > >>>>>>>> you're > >>>>>>>>> trying to achieve here. > >>>>>>>>> > >>>>>>>>> Yossi. > >>>>>>>>> > >>>>>>>>>> -----Original Message----- > >>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in] > >>>>>>>>>> Sent: 28 February 2018 13:55 > >>>>>>>>>> To: user@nutch.apache.org > >>>>>>>>>> Subject: Regarding Indexing to elasticsearch > >>>>>>>>>> > >>>>>>>>>> Can somebody please tell me what happens when we hit the > bin/nutc > >>>>>>>>>> index > >>>>>>>>> -all > >>>>>>>>>> command. > >>>>>>>>>> Because I can't figure out why the write function inside the > >>>>>>>>> elastic-indexer is not > >>>>>>>>>> getting executed. > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > > > >