Re: [MASSMAIL]RE: make responseTime native in nutch

Sebastian Nagel Tue, 07 Feb 2017 00:28:46 -0800

Hi,

> Y have looked index-metadata plugin and i think that the problem is when 
> Writable object is forced
to Text
> see
>
https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58


Good catch! The metadata value must implement Writable and must not necessarily 
be an instance of
Text. The actual class doesn't matter, the toString() method should return 
something meaningful.
Please, open a Jira issue to fix this problem of index-metadata.

@Eyeris: it's on you whether to open a separate issue to make the response time 
a "built-in" index
field.  We could add it to index-more which already provides fields related to 
the crawling:
last-modified, MIME type, content length.

Best,
Sebastian

On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote:
> Hello Markus.
> I have tried your recomendation using
> <property>
>   <name>index.db.md</name>
>   <value>_rs_</value>
>   <description>
>      Comma-separated list of keys to be taken from the crawldb metadata to 
> generate fields.
>      Can be used to index values propagated from the seeds with the plugin 
> urlmeta 
>   </description>
> </property>
> 
> 
> but i get the Exception(see below), by the indexer.
> ******************************************************************
> 2017-02-06 18:18:28,905 INFO  anchor.AnchorIndexingFilter - Anchor 
> deduplication is: on
> 2017-02-06 18:18:29,024 INFO  more.MoreIndexingFilter - Reading content type 
> mappings from file contenttype-mapping.txt
> 2017-02-06 18:18:29,849 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: host dest: host
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.description dest: description
> 2017-02-06 18:18:29,969 INFO  solr.SolrMappingReader - source: 
> metatag.keywords dest: keywords
> 2017-02-06 18:18:30,134 WARN  mapred.LocalJobRunner - job_local15168888_0001
> java.lang.Exception: java.lang.ClassCastException: 
> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable 
> cannot be cast to org.apache.hadoop.io.Text
>       at 
> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58)
>       at 
> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51)
>       at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330)
>       at 
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: 
> java.io.IOException: Job failed!
>       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
>       at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
>       at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
>       at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>       at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
> ******************************************************************
> 
> Y have looked index-metadata plugin and i think that the problem is when 
> Writable object is forced to Text
> see
> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
> 
> **********************************
> // add the fields from crawldb
>     if (dbFieldnames != null) {
>       for (String metatag : dbFieldnames) {
>         Text metadata = (Text) datum.getMetaData().get(new Text(metatag));
>         if (metadata != null)
>           doc.add(metatag, metadata.toString());
>       }
>     }
> ***************************************
> The line 58 need to be changed also to this:
> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString();
> 
> If you agree i can do the jira ticket and patch for this.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Markus Jelsma" <[email protected]>
> Para: [email protected]
> Enviados: Lunes, 6 de Febrero 2017 17:54:39
> Asunto: [MASSMAIL]RE: make responseTime native in nutch
> 
> Try this:
> 
> <property>
>   <name>index.db.md</name>
>   <value></value>
>   <description>
>      Comma-separated list of keys to be taken from the crawldb metadata to 
> generate fields.
>      Can be used to index values propagated from the seeds with the plugin 
> urlmeta 
>   </description>
> </property>
> 
> And enable index-metadata (iirc) plugin, you are good to go!
> 
> Cheers,
> Markus
> 
>  
>  
> -----Original message-----
>> From:Eyeris Rodriguez Rueda <[email protected]>
>> Sent: Monday 6th February 2017 15:56
>> To: [email protected]
>> Subject: make responseTime native in nutch
>>
>> Hi all.
>> Nutch has a configuration that permit save responseTime for every url that 
>> is fetched, and this value is stored in crawl Datum under the key _rs_ but 
>> not indexed.
>> Will be very usefull to index this value also.
>> This value is very important in all cases and it is very easy to make this 
>> native in nutch.
>> A little change to index basic plugin (or other) can make this happend.
>>
>>
>> //index responseTime for each url if http.store.responsetime is true
>>     boolean property= conf.getBoolean("http.store.responsetime",true);
>>     if (property == true){
>>       String value=datum.getMetaData().get(new Text("_rs_")).toString();
>>       doc.add("responseTime",value);
>>     }
>>
>> I can do the jira ticket ant patch for this.
>> What you think about it ?
> 
> //End of email,text below is autogenerated by my email server.
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Re: [MASSMAIL]RE: make responseTime native in nutch

Reply via email to