Re: Regarding Indexing to elasticsearch

Yash Thenuan Thenuan Fri, 02 Mar 2018 04:03:58 -0800

I got this after  setting log4j.logger.org.apache.hadoop to info

2018-03-02 17:29:40,157 INFO  indexer.IndexingJob - IndexingJob: starting
2018-03-02 17:29:40,775 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2018-03-02 17:29:40,853 INFO  Configuration.deprecation -
mapred.output.key.comparator.class is deprecated. Instead, use
mapreduce.job.output.key.comparator.class
2018-03-02 17:29:41,073 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 17:29:41,073 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 17:29:41,076 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 17:29:41,076 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 17:29:41,094 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 17:29:41,465 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 17:29:42,585 INFO  Configuration.deprecation - session.id is
deprecated. Instead, use dfs.metrics.session-id
2018-03-02 17:29:42,587 INFO  jvm.JvmMetrics - Initializing JVM Metrics
with processName=JobTracker, sessionId=
2018-03-02 17:29:43,277 INFO  mapreduce.JobSubmitter - number of splits:1
2018-03-02 17:29:43,501 INFO  mapreduce.JobSubmitter - Submitting tokens
for job: job_local1792747860_0001
2018-03-02 17:29:43,566 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 17:29:43,570 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 17:29:43,726 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2018-03-02 17:29:43,731 WARN  conf.Configuration -
file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2018-03-02 17:29:43,755 INFO  mapreduce.Job - The url to track the job:
http://localhost:8080/
2018-03-02 17:29:43,757 INFO  mapreduce.Job - Running job:
job_local1792747860_0001
2018-03-02 17:29:43,757 INFO  mapred.LocalJobRunner - OutputCommitter set
in config null
2018-03-02 17:29:43,767 INFO  mapred.LocalJobRunner - OutputCommitter is
org.apache.nutch.indexer.IndexerOutputFormat$2
2018-03-02 17:29:43,838 INFO  mapred.LocalJobRunner - Waiting for map tasks
2018-03-02 17:29:43,841 INFO  mapred.LocalJobRunner - Starting task:
attempt_local1792747860_0001_m_000000_0
2018-03-02 17:29:43,899 INFO  util.ProcfsBasedProcessTree -
ProcfsBasedProcessTree currently is supported only on Linux.
2018-03-02 17:29:43,899 INFO  mapred.Task -  Using
ResourceCalculatorProcessTree : null
2018-03-02 17:29:43,923 INFO  mapred.MapTask - Processing split:
org.apache.gora.mapreduce.GoraInputSplit@424b7f03
2018-03-02 17:29:44,051 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:44,767 INFO  mapreduce.Job - Job job_local1792747860_0001
running in uber mode : false
2018-03-02 17:29:44,769 INFO  mapreduce.Job -  map 0% reduce 0%
2018-03-02 17:29:50,926 INFO  basic.BasicIndexingFilter - Maximum title
length for indexing set to: -1
2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.basic.BasicIndexingFilter
2018-03-02 17:29:50,926 INFO  anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.anchor.AnchorIndexingFilter
2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.metadata.MetadataIndexer
2018-03-02 17:29:50,927 INFO  indexer.IndexingFilters - Adding
org.apache.nutch.indexer.more.MoreIndexingFilter
2018-03-02 17:29:51,153 INFO  mapred.LocalJobRunner -
2018-03-02 17:29:52,782 INFO  mapred.Task -
Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process
of committing
2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map
2018-03-02 17:29:52,825 INFO  mapred.Task - Task
'attempt_local1792747860_0001_m_000000_0' done.
2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - Finishing task:
attempt_local1792747860_0001_m_000000_0
2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map task executor
complete.
2018-03-02 17:29:53,791 INFO  mapreduce.Job -  map 100% reduce 0%
2018-03-02 17:29:53,791 INFO  mapreduce.Job - Job job_local1792747860_0001
completed successfully
2018-03-02 17:29:53,849 INFO  mapreduce.Job - Counters: 15
File System Counters
FILE: Number of bytes read=610359
FILE: Number of bytes written=891634
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=79
Map output records=0
Input split bytes=995
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=103
Total committed heap usage (bytes)=225443840
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
2018-03-02 17:29:53,866 INFO  indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
2018-03-02 17:29:53,866 INFO  indexer.IndexingJob - Active IndexWriters :
ElasticIndexWriter
elastic.cluster : elastic prefix cluster
elastic.host : hostname
elastic.port : port  (default 9200)
elastic.index : elastic index command
elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)



2018-03-02 17:29:53,925 INFO  indexer.IndexingJob - IndexingJob: done.


On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Hi,
>
> looks more like that there is nothing to index.
>
> Unfortunately, in 2.x there are no log messages
> on by default which indicate how many documents
> are sent to the index back-ends.
>
> The easiest way is to enable Job counters in
> conf/log4j.properties by adding the line:
>
>  log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>
> or setting the level to INFO for
>
>  log4j.logger.org.apache.hadoop=WARN
>
> Make sure the log4j.properties is correctly deployed
> (in doubt, run "ant runtime"). Then check the hadoop.log
> again: there should be a counter DocumentCount with non-zero
> value.
>
> Best,
> Sebastian
>
>
> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> > Following are the logs from hadoop.log
> >
> > 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
> > 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:48,663 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> > 2018-03-02 11:18:48,666 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
> staging/job_local1100834069_0001/job.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts;  Ignoring.
> > 2018-03-02 11:18:48,792 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> > 2018-03-02 11:18:48,798 WARN  conf.Configuration -
> > file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
> local1100834069_0001/job_local1100834069_0001.xml:an
> > attempt to override final parameter:
> > mapreduce.job.end-notification.max.attempts;  Ignoring.
> > 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
> > length for indexing set to: -1
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> > 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.metadata.MetadataIndexer
> > 2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
> > org.apache.nutch.indexer.more.MoreIndexingFilter
> > 2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> > 2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
> > ElasticIndexWriter
> > elastic.cluster : elastic prefix cluster
> > elastic.host : hostname
> > elastic.port : port  (default 9200)
> > elastic.index : elastic index command
> > elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> > elastic.max.bulk.size : elastic bulk index length. (default 2500500
> ~2.5MB)
> >
> >
> > 2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.
> >
> >
> > On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <
> wastl.na...@googlemail.com
> >> wrote:
> >
> >> It's impossible to find the reason from console output.
> >> Please check the hadoop.log, it should contain more logs
> >> including those from ElasticIndexWriter.
> >>
> >> Sebastian
> >>
> >> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
> >>> Hi Sebastian All of this is coming but the problem is,The content is
> not
> >>> sent sent.Nothing is indexed to es.
> >>> This is the output on debug level.
> >>>
> >>> ElasticIndexWriter
> >>>
> >>> elastic.cluster : elastic prefix cluster
> >>>
> >>> elastic.host : hostname
> >>>
> >>> elastic.port : port  (default 9200)
> >>>
> >>> elastic.index : elastic index command
> >>>
> >>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> >>>
> >>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
> >> ~2.5MB)
> >>>
> >>>
> >>> no modules loaded
> >>>
> >>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
> >>>
> >>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
> >>>
> >>> created thread pool: name [force_merge], size [1], queue size
> [unbounded]
> >>>
> >>> created thread pool: name [fetch_shard_started], core [1], max [8],
> keep
> >>> alive [5m]
> >>>
> >>> created thread pool: name [listener], size [2], queue size [unbounded]
> >>>
> >>> created thread pool: name [index], size [4], queue size [200]
> >>>
> >>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
> >>>
> >>> created thread pool: name [generic], core [4], max [128], keep alive
> >> [30s]
> >>>
> >>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
> >>>
> >>> thread pool [search] will adjust queue by [50] when determining
> automatic
> >>> queue size
> >>>
> >>> created thread pool: name [search], size [7], queue size [1k]
> >>>
> >>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
> >>>
> >>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
> >>> alive [5m]
> >>>
> >>> created thread pool: name [management], core [1], max [5], keep alive
> >> [5m]
> >>>
> >>> created thread pool: name [get], size [4], queue size [1k]
> >>>
> >>> created thread pool: name [bulk], size [4], queue size [200]
> >>>
> >>> created thread pool: name [snapshot], core [1], max [2], keep alive
> [5m]
> >>>
> >>> node_sampler_interval[5s]
> >>>
> >>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
> >>> 127.0.0.1:9300}]
> >>>
> >>> connected to node
> >>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
> >>> 127.0.0.1:9300}]
> >>>
> >>> IndexingJob: done
> >>>
> >>>
> >>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
> >>> wastl.na...@googlemail.com> wrote:
> >>>
> >>>> I never tried ES with Nutch 2.3 but it should be similar to setup as
> for
> >>>> 1.x:
> >>>>
> >>>> - enable the plugin "indexer-elastic" in plugin.includes
> >>>>   (upgrade and rename to "indexer-elastic2" in 2.4)
> >>>>
> >>>> - expects ES 1.4.1
> >>>>
> >>>> - available/required options are found in the log file (hadoop.log):
> >>>>    ElasticIndexWriter
> >>>>         elastic.cluster : elastic prefix cluster
> >>>>         elastic.host : hostname
> >>>>         elastic.port : port  (default 9300)
> >>>>         elastic.index : elastic index command
> >>>>         elastic.max.bulk.docs : elastic bulk index doc counts.
> (default
> >>>> 250)
> >>>>         elastic.max.bulk.size : elastic bulk index length. (default
> >>>> 2500500 ~2.5MB)
> >>>>
> >>>> Sebastian
> >>>>
> >>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
> >>>>> Yeah
> >>>>> I was also thinking that
> >>>>> Can somebody help me with nutch 2.3?
> >>>>>
> >>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <yossi.tam...@pipl.com> wrote:
> >>>>>
> >>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
> >> for
> >>>>>> Nutch 1.x. I'm afraid I can't help you.
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> >>>>>>> Sent: 28 February 2018 14:20
> >>>>>>> To: user@nutch.apache.org
> >>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>
> >>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is
> the
> >>>>>> output of
> >>>>>>> nutch index i have already configured the nutch-site.xml.
> >>>>>>>
> >>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <yossi.tam...@pipl.com>
> wrote:
> >>>>>>>
> >>>>>>>> I suggest you run "nutch index", take a look at the returned help
> >>>>>>>> message, and continue from there.
> >>>>>>>> Broadly, first of all you need to configure your elasticsearch
> >>>>>>>> environment in nutch-site.xml, and then you need to run nutch
> index
> >>>>>>>> with the location of your CrawlDB and either the segment you want
> to
> >>>>>>>> index or the directory that contains all the segments you want to
> >>>>>> index.
> >>>>>>>>
> >>>>>>>>> -----Original Message-----
> >>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> >>>>>>>>> Sent: 28 February 2018 14:06
> >>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
> >>>>>>>>>
> >>>>>>>>> All I want  is to index my parsed data to elasticsearch.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <yossi.tam...@pipl.com>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi Yash,
> >>>>>>>>>
> >>>>>>>>> The nutch index command does not have a -all flag, so I'm not
> sure
> >>>>>>>>> what
> >>>>>>>> you're
> >>>>>>>>> trying to achieve here.
> >>>>>>>>>
> >>>>>>>>>         Yossi.
> >>>>>>>>>
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Yash Thenuan Thenuan [mailto:rit2014...@iiita.ac.in]
> >>>>>>>>>> Sent: 28 February 2018 13:55
> >>>>>>>>>> To: user@nutch.apache.org
> >>>>>>>>>> Subject: Regarding Indexing to elasticsearch
> >>>>>>>>>>
> >>>>>>>>>> Can somebody please tell me what happens when we hit the
> bin/nutc
> >>>>>>>>>> index
> >>>>>>>>> -all
> >>>>>>>>>> command.
> >>>>>>>>>> Because I can't figure out why the write function inside the
> >>>>>>>>> elastic-indexer is not
> >>>>>>>>>> getting executed.
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >
>
>

Re: Regarding Indexing to elasticsearch

Reply via email to