Re: Regarding Indexing to elasticsearch

Sebastian Nagel Fri, 02 Mar 2018 01:39:08 -0800

Hi,

looks more like that there is nothing to index.


Unfortunately, in 2.x there are no log messages
on by default which indicate how many documents
are sent to the index back-ends.

The easiest way is to enable Job counters in
conf/log4j.properties by adding the line:

 log4j.logger.org.apache.hadoop.mapreduce.Job=INFO

or setting the level to INFO for

 log4j.logger.org.apache.hadoop=WARN

Make sure the log4j.properties is correctly deployed
(in doubt, run "ant runtime"). Then check the hadoop.log
again: there should be a counter DocumentCount with non-zero
value.

Best,
Sebastian


On 03/02/2018 06:50 AM, Yash Thenuan Thenuan wrote:
> Following are the logs from hadoop.log
> 
> 2018-03-02 11:18:45,220 INFO  indexer.IndexingJob - IndexingJob: starting
> 2018-03-02 11:18:45,791 WARN  util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2018-03-02 11:18:46,138 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 11:18:46,138 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 11:18:46,140 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 11:18:46,140 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 11:18:46,157 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 11:18:46,535 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 11:18:48,663 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 11:18:48,666 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/staging/yasht1100834069/.staging/job_local1100834069_0001/job.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 11:18:48,792 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.retry.interval;  Ignoring.
> 2018-03-02 11:18:48,798 WARN  conf.Configuration -
> file:/tmp/hadoop-yasht/mapred/local/localRunner/yasht/job_local1100834069_0001/job_local1100834069_0001.xml:an
> attempt to override final parameter:
> mapreduce.job.end-notification.max.attempts;  Ignoring.
> 2018-03-02 11:18:49,093 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 11:18:54,737 INFO  basic.BasicIndexingFilter - Maximum title
> length for indexing set to: -1
> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 2018-03-02 11:18:54,737 INFO  anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.anchor.AnchorIndexingFilter
> 2018-03-02 11:18:54,737 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.metadata.MetadataIndexer
> 2018-03-02 11:18:54,738 INFO  indexer.IndexingFilters - Adding
> org.apache.nutch.indexer.more.MoreIndexingFilter
> 2018-03-02 11:18:56,883 INFO  indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
> 2018-03-02 11:18:56,884 INFO  indexer.IndexingJob - Active IndexWriters :
> ElasticIndexWriter
> elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port  (default 9200)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
> elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)
> 
> 
> 2018-03-02 11:18:56,939 INFO  indexer.IndexingJob - IndexingJob: done.
> 
> 
> On Thu, Mar 1, 2018 at 10:11 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> It's impossible to find the reason from console output.
>> Please check the hadoop.log, it should contain more logs
>> including those from ElasticIndexWriter.
>>
>> Sebastian
>>
>> On 03/01/2018 06:38 AM, Yash Thenuan Thenuan wrote:
>>> Hi Sebastian All of this is coming but the problem is,The content is not
>>> sent sent.Nothing is indexed to es.
>>> This is the output on debug level.
>>>
>>> ElasticIndexWriter
>>>
>>> elastic.cluster : elastic prefix cluster
>>>
>>> elastic.host : hostname
>>>
>>> elastic.port : port  (default 9200)
>>>
>>> elastic.index : elastic index command
>>>
>>> elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
>>>
>>> elastic.max.bulk.size : elastic bulk index length. (default 2500500
>> ~2.5MB)
>>>
>>>
>>> no modules loaded
>>>
>>> loaded plugin [org.elasticsearch.index.reindex.ReindexPlugin]
>>>
>>> loaded plugin [org.elasticsearch.join.ParentJoinPlugin]
>>>
>>> loaded plugin [org.elasticsearch.percolator.PercolatorPlugin]
>>>
>>> loaded plugin [org.elasticsearch.script.mustache.MustachePlugin]
>>>
>>> loaded plugin [org.elasticsearch.transport.Netty4Plugin]
>>>
>>> created thread pool: name [force_merge], size [1], queue size [unbounded]
>>>
>>> created thread pool: name [fetch_shard_started], core [1], max [8], keep
>>> alive [5m]
>>>
>>> created thread pool: name [listener], size [2], queue size [unbounded]
>>>
>>> created thread pool: name [index], size [4], queue size [200]
>>>
>>> created thread pool: name [refresh], core [1], max [2], keep alive [5m]
>>>
>>> created thread pool: name [generic], core [4], max [128], keep alive
>> [30s]
>>>
>>> created thread pool: name [warmer], core [1], max [2], keep alive [5m]
>>>
>>> thread pool [search] will adjust queue by [50] when determining automatic
>>> queue size
>>>
>>> created thread pool: name [search], size [7], queue size [1k]
>>>
>>> created thread pool: name [flush], core [1], max [2], keep alive [5m]
>>>
>>> created thread pool: name [fetch_shard_store], core [1], max [8], keep
>>> alive [5m]
>>>
>>> created thread pool: name [management], core [1], max [5], keep alive
>> [5m]
>>>
>>> created thread pool: name [get], size [4], queue size [1k]
>>>
>>> created thread pool: name [bulk], size [4], queue size [200]
>>>
>>> created thread pool: name [snapshot], core [1], max [2], keep alive [5m]
>>>
>>> node_sampler_interval[5s]
>>>
>>> adding address [{#transport#-1}{nNtPR9OJShWSW-ayXRDILA}{localhost}{
>>> 127.0.0.1:9300}]
>>>
>>> connected to node
>>> [{tzfqJn0}{tzfqJn0sS5OPV4lKreU60w}{QCGd9doAQaGw4Q_lOqniLQ}{127.0.0.1}{
>>> 127.0.0.1:9300}]
>>>
>>> IndexingJob: done
>>>
>>>
>>> On Wed, Feb 28, 2018 at 10:05 PM, Sebastian Nagel <
>>> [email protected]> wrote:
>>>
>>>> I never tried ES with Nutch 2.3 but it should be similar to setup as for
>>>> 1.x:
>>>>
>>>> - enable the plugin "indexer-elastic" in plugin.includes
>>>>   (upgrade and rename to "indexer-elastic2" in 2.4)
>>>>
>>>> - expects ES 1.4.1
>>>>
>>>> - available/required options are found in the log file (hadoop.log):
>>>>    ElasticIndexWriter
>>>>         elastic.cluster : elastic prefix cluster
>>>>         elastic.host : hostname
>>>>         elastic.port : port  (default 9300)
>>>>         elastic.index : elastic index command
>>>>         elastic.max.bulk.docs : elastic bulk index doc counts. (default
>>>> 250)
>>>>         elastic.max.bulk.size : elastic bulk index length. (default
>>>> 2500500 ~2.5MB)
>>>>
>>>> Sebastian
>>>>
>>>> On 02/28/2018 01:26 PM, Yash Thenuan Thenuan wrote:
>>>>> Yeah
>>>>> I was also thinking that
>>>>> Can somebody help me with nutch 2.3?
>>>>>
>>>>> On 28 Feb 2018 17:53, "Yossi Tamari" <[email protected]> wrote:
>>>>>
>>>>>> Sorry, I just realized that you're using Nutch 2.x and I'm answering
>> for
>>>>>> Nutch 1.x. I'm afraid I can't help you.
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Yash Thenuan Thenuan [mailto:[email protected]]
>>>>>>> Sent: 28 February 2018 14:20
>>>>>>> To: [email protected]
>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>
>>>>>>> IndexingJob (<batchId> | -all |-reindex) [-crawlId <id>] This is the
>>>>>> output of
>>>>>>> nutch index i have already configured the nutch-site.xml.
>>>>>>>
>>>>>>> On 28 Feb 2018 17:41, "Yossi Tamari" <[email protected]> wrote:
>>>>>>>
>>>>>>>> I suggest you run "nutch index", take a look at the returned help
>>>>>>>> message, and continue from there.
>>>>>>>> Broadly, first of all you need to configure your elasticsearch
>>>>>>>> environment in nutch-site.xml, and then you need to run nutch index
>>>>>>>> with the location of your CrawlDB and either the segment you want to
>>>>>>>> index or the directory that contains all the segments you want to
>>>>>> index.
>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Yash Thenuan Thenuan [mailto:[email protected]]
>>>>>>>>> Sent: 28 February 2018 14:06
>>>>>>>>> To: [email protected]
>>>>>>>>> Subject: RE: Regarding Indexing to elasticsearch
>>>>>>>>>
>>>>>>>>> All I want  is to index my parsed data to elasticsearch.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 28 Feb 2018 17:34, "Yossi Tamari" <[email protected]>
>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Yash,
>>>>>>>>>
>>>>>>>>> The nutch index command does not have a -all flag, so I'm not sure
>>>>>>>>> what
>>>>>>>> you're
>>>>>>>>> trying to achieve here.
>>>>>>>>>
>>>>>>>>>         Yossi.
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Yash Thenuan Thenuan [mailto:[email protected]]
>>>>>>>>>> Sent: 28 February 2018 13:55
>>>>>>>>>> To: [email protected]
>>>>>>>>>> Subject: Regarding Indexing to elasticsearch
>>>>>>>>>>
>>>>>>>>>> Can somebody please tell me what happens when we hit the bin/nutc
>>>>>>>>>> index
>>>>>>>>> -all
>>>>>>>>>> command.
>>>>>>>>>> Because I can't figure out why the write function inside the
>>>>>>>>> elastic-indexer is not
>>>>>>>>>> getting executed.
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Regarding Indexing to elasticsearch

Reply via email to