Re: Regarding Indexing to elasticsearch

Sebastian Nagel Fri, 02 Mar 2018 04:59:41 -0800

Hi,

> Map input records=79
> Map output records=0


... and no IndexerJob:DocumentCount counter

The map function got 79 records as input,
but did not write anything to the indexer.
There are a couple of reasons why a document is skipped,
e.g., nothing parsed, missing markers, errors in indexing filters, ...

Have a look at the map method:

https://github.com/apache/nutch/blob/branch-2.3.1/src/java/org/apache/nutch/indexer/IndexingJob.java#L95

and start debugging it. Alternatively, check your table and
the log files of the previous steps. There must be a reason
why nothing is indexed.

Best,
Sebastian


On 03/02/2018 01:03 PM, Yash Thenuan Thenuan wrote:
> I got this after  setting log4j.logger.org.apache.hadoop to info
> 
> 2018-03-02 17:29:40,157 INFO  indexer.IndexingJob - IndexingJob: starting
> 2018-03-02 17:29:40,775 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-03-02 17:29:40,853 INFO  Configuration.deprecation -
> mapred.output.key.comparator.class is deprecated. Instead, use
> mapreduce.job.output.key.comparator.class
> 2018-03-02 17:29:41,073 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 17:29:41,073 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 17:29:41,076 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 17:29:41,076 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 17:29:41,094 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 17:29:41,465 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 17:29:42,585 INFO  Configuration.deprecation - session.id is
> deprecated. Instead, use dfs.metrics.session-id
> 2018-03-02 17:29:42,587 INFO  jvm.JvmMetrics - Initializing JVM Metrics
> with processName=JobTracker, sessionId=
> 2018-03-02 17:29:43,277 INFO  mapreduce.JobSubmitter - number of splits:1
> 2018-03-02 17:29:43,501 INFO  mapreduce.JobSubmitter - Submitting tokens
> for job: job_local1792747860_0001
> 2018-03-02 17:29:43,566 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 17:29:43,570 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1792747860/.staging/job_local1792747860_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 17:29:43,726 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 17:29:43,731 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1792747860_0001/job_local1792747860_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 17:29:43,755 INFO  mapreduce.Job - The url to track the job:
> http://localhost:8080/
> 2018-03-02 17:29:43,757 INFO  mapreduce.Job - Running job:
> job_local1792747860_0001
> 2018-03-02 17:29:43,757 INFO  mapred.LocalJobRunner - OutputCommitter set
> in config null
> 2018-03-02 17:29:43,767 INFO  mapred.LocalJobRunner - OutputCommitter is
> org.apache.nutch.indexer.IndexerOutputFormat$2
> 2018-03-02 17:29:43,838 INFO  mapred.LocalJobRunner - Waiting for map tasks
> 2018-03-02 17:29:43,841 INFO  mapred.LocalJobRunner - Starting task:
> attempt_local1792747860_0001_m_000000_0
> 2018-03-02 17:29:43,899 INFO  util.ProcfsBasedProcessTree -
> ProcfsBasedProcessTree currently is supported only on Linux.
> 2018-03-02 17:29:43,899 INFO  mapred.Task -  Using
> ResourceCalculatorProcessTree : null
> 2018-03-02 17:29:43,923 INFO  mapred.MapTask - Processing split:
> org.apache.gora.mapreduce.GoraInputSplit@424b7f03
> 2018-03-02 17:29:44,051 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 17:29:44,767 INFO  mapreduce.Job - Job job_local1792747860_0001
> running in uber mode : false
> 2018-03-02 17:29:44,769 INFO  mapreduce.Job -  map 0% reduce 0%
> 2018-03-02 17:29:50,926 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 17:29:50,926 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 17:29:50,926 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 17:29:50,927 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 17:29:51,153 INFO  mapred.LocalJobRunner -
> 2018-03-02 17:29:52,782 INFO  mapred.Task -
> Task:attempt_local1792747860_0001_m_000000_0 is done. And is in the process
> of committing
> 2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map
> 2018-03-02 17:29:52,825 INFO  mapred.Task - Task
> 'attempt_local1792747860_0001_m_000000_0' done.
> 2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - Finishing task:
> attempt_local1792747860_0001_m_000000_0
> 2018-03-02 17:29:52,825 INFO  mapred.LocalJobRunner - map task executor
> complete.
> 2018-03-02 17:29:53,791 INFO  mapreduce.Job -  map 100% reduce 0%
> 2018-03-02 17:29:53,791 INFO  mapreduce.Job - Job job_local1792747860_0001
> completed successfully
> 2018-03-02 17:29:53,849 INFO  mapreduce.Job - Counters: 15
> File System Counters
> FILE: Number of bytes read=610359
> FILE: Number of bytes written=891634
> FILE: Number of read operations=0
> FILE: Number of large read operations=0
> FILE: Number of write operations=0
> Map-Reduce Framework
> Map input records=79
> Map output records=0
> Input split bytes=995
> Spilled Records=0
> Failed Shuffles=0
> Merged Map outputs=0
> GC time elapsed (ms)=103
> Total committed heap usage (bytes)=225443840
> File Input Format Counters
> Bytes Read=0
> File Output Format Counters
> Bytes Written=0
> 2018-03-02 17:29:53,866 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 17:29:53,866 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port  (default 9200)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 
> 
> 2018-03-02 17:29:53,925 INFO  indexer.IndexingJob - IndexingJob: done.
> 
> 
> On Fri, Mar 2, 2018 at 3:08 PM, Sebastian Nagel <[email protected]>
> wrote:
> 
>> Hi,
>>
>> looks more like that there is nothing to index.
>>
>> Unfortunately, in 2.x there are no log messages
>> on by default which indicate how many documents
>> are sent to the index back-ends.
>>
>> The easiest way is to enable Job counters in
>> conf/log4j.properties by adding the line:
>>
>>  log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
>>
>> or setting the level to INFO for
>>
>>  log4j.logger.org.apache.hadoop=WARN
>>
>> Make sure the log4j.properties is correctly deployed
>> (in doubt, run "ant runtime"). Then check the hadoop.log
>> again: there should be a counter DocumentCount with non-zero
>> value.
>>
>> Best,
>> Sebastian
>>
>>
>> On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
>>> Following are the logs from hadoop.log
>>>
>>> 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
>>> 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>> where
>>> applicable
>>> 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
>>> length for indexing set to: -1
>>> 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>>> 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.metadata.MetadataIndexer
>>> 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.more.MoreIndexingFilter
>>> 2018-03-02 11:18:48,663 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
>> staging/job_local1100834069_0001/job.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>>> 2018-03-02 11:18:48,666 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.
>> staging/job_local1100834069_0001/job.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>>> 2018-03-02 11:18:48,792 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
>> local1100834069_0001/job_local1100834069_0001.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
>>> 2018-03-02 11:18:48,798 WARN  conf.Configuration -
>>> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_
>> local1100834069_0001/job_local1100834069_0001.xml:an
>>> attempt to override final parameter:
>>> mapreduce.job.end-notification.max.attempts;  Ignoring.
>>> 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
>>> 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
>>> length for indexing set to: -1
>>> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>>> 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
>>> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.metadata.MetadataIndexer
>>> 2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
>>> org.apache.nutch.indexer.more.MoreIndexingFilter
>>> 2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
>>> 2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
>>> ElasticIndexWriter
>>> elastic.cluster : elastic prefix cluster
>>> elastic.host : hostname
>>> elastic.port : port  (default 9200)
>>> elastic.index : elastic index command
>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>> ~2.5MB)
>>>
>>>
>>> 2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.
>>>
>>>
>>> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <
>> [email protected]
>>>> wrote:
>>>
>>>> It's impossible to find the reason from console output.
>>>> Please check the hadoop.log, it should contain more logs
>>>> including those from ElasticIndexWriter.
>>>>
>>>> Sebastian
>>>>
>>>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
>>>>> Hi Sebastian All of this is coming but the problem is,The content is
>> not
>>>>> sent sent.Nothing is indexed to es.
>>>>> This is the output on debug level.
>>>>>
>>>>> ElasticIndexWriter
>>>>>
>>>>> elastic.cluster : elastic prefix cluster
>>>>>
>>>>> elastic.host : hostname
>>>>>
>>>>> elastic.port : port  (default 9200)
>>>>>
>>>>> elastic.index : elastic index command
>>>>>
>>>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>>>>
>>>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>>>> ~2.5MB)
>>>>>
>>>>>
>>>>> no modules loaded
>>>>>
>>>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>>>>>
>>>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>>>>>
>>>>> created thread pool: name [force_merge], size [1], queue size
>> [unbounded]
>>>>>
>>>>> created thread pool: name [fetch_shard_started], core [1], max [8],
>> keep
>>>>> alive [5m]
>>>>>
>>>>> created thread pool: name [listener], size [2], queue size [unbounded]
>>>>>
>>>>> created thread pool: name [index], size [4], queue size [200]
>>>>>
>>>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>>>>>
>>>>> created thread pool: name [generic], core [4], max [128], keep alive
>>>> [30s]
>>>>>
>>>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>>>>>
>>>>> thread pool [search] will adjust queue by [50] when determining
>> automatic
>>>>> queue size
>>>>>
>>>>> created thread pool: name [search], size [7], queue size [1k]
>>>>>
>>>>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>>>>>
>>>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
>>>>> alive [5m]
>>>>>
>>>>> created thread pool: name [management], core [1], max [5], keep alive
>>>> [5m]
>>>>>
>>>>> created thread pool: name [get], size [4], queue size [1k]
>>>>>
>>>>> created thread pool: name [bulk], size [4], queue size [200]
>>>>>
>>>>> created thread pool: name [snapshot], core [1], max [2], keep alive
>> [5m]
>>>>>
>>>>> node_sampler_interval[5s]
>>>>>
>>>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
>>>>> 127.0.0.1:9300}]
>>>>>
>>>>> connected to node
>>>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
>>>>> 127.0.0.1:9300}]
>>>>>
>>>>> IndexingJob: done
>>>>>
>>>>>
>>>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as
>> for
>>>>>> 1.x:
>>>>>>
>>>>>> - enable the plugin "indexer-elastic" in plugin.includes
>>>>>>   (upgrade and rename to "indexer-elastic2" in 2.4)
>>>>>>
>>>>>> - expects ES 1.4.1
>>>>>>
>>>>>> - available/required options are found in the log file (hadoop.log):
>>>>>>    ElasticIndexWriter
>>>>>>         elastic.cluster : elastic prefix cluster
>>>>>>         elastic.host : hostname
>>>>>>         elastic.port : port  (default 9300)
>>>>>>         elastic.index : elastic index command
>>>>>>         elastic.max.bulk.docs : elastic bulk index doc counts.
>> (default
>>>>>> 250)
>>>>>>         elastic.max.bulk.size : elastic bulk index length. (default
>>>>>> 2500500 ~2.5MB)
>>>>>>
>>>>>> Sebastian
>>>>>>
>>>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>>>>>> Yeah
>>>>>>> I was also thinking that
>>>>>>> Can somebody help me with nutch 2.3?
>>>>>>>
>>>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <[email protected]> wrote:
>>>>>>>
>>>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
>>>> for
>>>>>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Yash Thenuan Thenuan [mailto:[email protected]]
>>>>>>>>> Sent: 28 February 2018 14:20
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>
>>>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is
>> the
>>>>>>>> output of
>>>>>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>>>>>
>>>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <[email protected]>
>> wrote:
>>>>>>>>>
>>>>>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>>>>>> message, and continue from there.
>>>>>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>>>>>> environment in nutch-site.xml, and then you need to run nutch
>> index
>>>>>>>>>> with the location of your CrawlDB and either the segment you want
>> to
>>>>>>>>>> index or the directory that contains all the segments you want to
>>>>>>>> index.
>>>>>>>>>>
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:[email protected]]
>>>>>>>>>>> Sent: 28 February 2018 14:06
>>>>>>>>>>> To: [email protected]
>>>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>>>
>>>>>>>>>>> All I want  is to index my parsed data to elasticsearch.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <[email protected]>
>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Yash,
>>>>>>>>>>>
>>>>>>>>>>> The nutch index command does not have a -all flag, so I'm not
>> sure
>>>>>>>>>>> what
>>>>>>>>>> you're
>>>>>>>>>>> trying to achieve here.
>>>>>>>>>>>
>>>>>>>>>>>         Yossi.
>>>>>>>>>>>
>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:[email protected]]
>>>>>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>>>>>> To: [email protected]
>>>>>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>>>>>
>>>>>>>>>>>> Can somebody please tell me what happens when we hit the
>> bin/nutc
>>>>>>>>>>>> index
>>>>>>>>>>> -all
>>>>>>>>>>>> command.
>>>>>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>>>>>> elastic-indexer is not
>>>>>>>>>>>> getting executed.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Regarding Indexing to elasticsearch

Reply via email to