Re: [MASSMAIL]RE: make responseTime native in nutch

Sebastian Nagel Wed, 08 Feb 2017 05:36:08 -0800

Hi,

> https://issues.apache.org/jira/browse/NUTCH-2357


thanks!

> There are some plugins that add fields to doc that will be indexed, but i 
> can't find one place
that describe every fields that nutch send to index.

conf/schema.xml
 - should list all fields filled by core or any of the indexing filter plugins

Yes, names are hard-coded, a change to the name of an index field must be 
applied to both schema
and Java code.

Ideally, the fields are also listed in the wiki
  https://wiki.apache.org/nutch/IndexStructure
(but currently, some plugins and their fields are not listed there)

> I think that before to add a field to a doc, nutch should check if that field 
> is present in
schema.xml or not.

Both IndexingFilters which add fields and indexer plugins (interface 
IndexWriter) are plugins
and should work independently.  A possible improvement could be to add methods 
which let
- indexing filters announce the filled fields, resp.
- index writers list the required or optionally accepted fields
The indexing job can then check in advance for undeclared index fields.

Best,
Sebastian


On 02/07/2017 03:22 PM, Eyeris Rodriguez Rueda wrote:
> Thanks Sebastian.
> I have open a ticket for the problem in index metadata
> This the url.
> https://issues.apache.org/jira/browse/NUTCH-2357
> 
> 
> 
> About the responseTime in index more its looks great.
> One new field is needed into indexer also(solr,elastic),if not, nutch will 
> thrown an Exception.
> 
> One more thing.
> There are some plugins that add fields to doc that will be indexed, but i 
> can't find one place that describe every fields that nutch send to index.
> Index basic send (domain,host,url,content,title,cache,tstamp).
> Index more send (type,date,contentLength)
> and others.
> I have look into nutch code but i think that nutch don't use schema.xml 
> Is there any way to know all fields that nutch send into indexer (solr or 
> other)?
> i mean, apart of look at the code of every index-* plugin of course.
> If i delete one of this fields into solr, then nutch thrown an Exception, 
> because every field is named static in code, for example
> 
> doc.add("host", host);
> I think that before to add a field to a doc, nutch should check if that field 
> is present in schema.xml or not.
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Sebastian Nagel" <[email protected]>
> Para: [email protected]
> Enviados: Martes, 7 de Febrero 2017 4:28:03
> Asunto: Re: [MASSMAIL]RE: make responseTime native in nutch
> 
> Hi,
> 
>> Y have looked index-metadata plugin and i think that the problem is when 
>> Writable object is forced
> to Text
>> see
>>
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> 
> Good catch! The metadata value must implement Writable and must not 
> necessarily be an instance of
> Text. The actual class doesn't matter, the toString() method should return 
> something meaningful.
> Please, open a Jira issue to fix this problem of index-metadata.
> 
> @Eyeris: it's on you whether to open a separate issue to make the response 
> time a "built-in" index
> field.  We could add it to index-more which already provides fields related 
> to the crawling:
> last-modified, MIME type, content length.
> 
> Best,
> Sebastian
> 
> On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote:
>> Hello Markus.
>> I have tried your recomendation using
>> <property>
>>   <name>index.db.mdname>
>>   <value>_rs_value>
>>   <description>
>>      Comma-separated list of keys to be taken from the crawldb metadata to 
>> generate fields.
>>      Can be used to index values propagated from the seeds with the plugin 
>> urlmeta 
>>   description>
>> property>
>>
>>
>> but i get the Exception(see below), by the indexer.
>> ******************************************************************
>> 2017-02-06 18:18:28,905 INFO  anchor.AnchorIndexingFilter - Anchor 
>> deduplication is: on
>> 2017-02-06 18:18:29,024 INFO  more.MoreIndexingFilter - Reading content type 
>> mappings from file contenttype-mapping.txt
>> 2017-02-06 18:18:29,849 INFO  indexer.IndexWriters - Adding 
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: content dest: 
>> content
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: title dest: 
>> title
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: host dest: 
>> host
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: segment dest: 
>> segment
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: boost dest: 
>> boost
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
>> digest
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
>> tstamp
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
>> metatag.description dest: description
>> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
>> metatag.keywords dest: keywords
>> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local15168888_0001
>> java.lang.Exception: java.lang.ClassCastException: 
>> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>>         at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>>         at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
>> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
>> cannot be cast to org.apache.hadoop.io.Text
>>         at 
>> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>>         at 
>> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>>         at 
>> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>>         at 
>> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>>         at 
>> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>>         at 
>> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>>         at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
>> java.io.IOException: Job failed!
>>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
>> ******************************************************************
>>
>> Y have looked index-metadata plugin and i think that the problem is when 
>> Writable object is forced to Text
>> see
>> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
>>
>> **********************************
>> // add the fields from crawldb
>>     if (dbFieldnames != null) {
>>       for (String metatag : dbFieldnames) {
>>         Text metadata = (Text) datum.getMetaData().get(new Text(metatag));
>>         if (metadata != null)
>>           doc.add(metatag, metadata.toString());
>>       }
>>     }
>> ***************************************
>> The line 58 need to be changed also to this:
>> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString();
>>
>> If you agree i can do the jira ticket and patch for this.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Mensaje original -----
>> De: "Markus Jelsma" <[email protected]>
>> Para: [email protected]
>> Enviados: Lunes, 6 de Febrero 2017 17:54:39
>> Asunto: [MASSMAIL]RE: make responseTime native in nutch
>>
>> Try this:
>>
>> <property>
>>   <name>index.db.mdname>
>>   <value>value>
>>   <description>
>>      Comma-separated list of keys to be taken from the crawldb metadata to 
>> generate fields.
>>      Can be used to index values propagated from the seeds with the plugin 
>> urlmeta 
>>   description>
>> property>
>>
>> And enable index-metadata (iirc) plugin, you are good to go!
>>
>> Cheers,
>> Markus
>>
>>  
>>  
>> -----Original message-----
>>> From:Eyeris Rodriguez Rueda <[email protected]>
>>> Sent: Monday 6th February 2017 15:56
>>> To: [email protected]
>>> Subject: make responseTime native in nutch
>>>
>>> Hi all.
>>> Nutch has a configuration that permit save responseTime for every url that 
>>> is fetched, and this value is stored in crawl Datum under the key _rs_ but 
>>> not indexed.
>>> Will be very usefull to index this value also.
>>> This value is very important in all cases and it is very easy to make this 
>>> native in nutch.
>>> A little change to index basic plugin (or other) can make this happend.
>>>
>>>
>>> //index responseTime for each url if http.store.responsetime is true
>>>     boolean property= conf.getBoolean("http.store.responsetime",true);
>>>     if (property == true){
>>>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>>>       doc.add("responseTime",value);
>>>     }
>>>
>>> I can do the jira ticket ant patch for this.
>>> What you think about it ?
>>
>> //End of email,text below is autogenerated by my email server.
> 
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Re: [MASSMAIL]RE: make responseTime native in nutch

Reply via email to