Hi, > Y have looked index-metadata plugin and i think that the problem is when > Writable object is forced to Text > see > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58
Good catch! The metadata value must implement Writable and must not necessarily be an instance of Text. The actual class doesn't matter, the toString() method should return something meaningful. Please, open a Jira issue to fix this problem of index-metadata. @Eyeris: it's on you whether to open a separate issue to make the response time a "built-in" index field. We could add it to index-more which already provides fields related to the crawling: last-modified, MIME type, content length. Best, Sebastian On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote: > Hello Markus. > I have tried your recomendation using > <property> > <name>index.db.md</name> > <value>_rs_</value> > <description> > Comma-separated list of keys to be taken from the crawldb metadata to > generate fields. > Can be used to index values propagated from the seeds with the plugin > urlmeta > </description> > </property> > > > but i get the Exception(see below), by the indexer. > ****************************************************************** > 2017-02-06 18:18:28,905 INFO anchor.AnchorIndexingFilter - Anchor > deduplication is: on > 2017-02-06 18:18:29,024 INFO more.MoreIndexingFilter - Reading content type > mappings from file contenttype-mapping.txt > 2017-02-06 18:18:29,849 INFO indexer.IndexWriters - Adding > org.apache.nutch.indexwriter.solr.SolrIndexWriter > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: content dest: > content > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: title dest: > title > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: host dest: host > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: segment dest: > segment > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: boost dest: > boost > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: > digest > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: > tstamp > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.description dest: description > 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: > metatag.keywords dest: keywords > 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local15168888_0001 > java.lang.Exception: java.lang.ClassCastException: > org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable > cannot be cast to org.apache.hadoop.io.Text > at > org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) > at > org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) > at > org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: > java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) > ****************************************************************** > > Y have looked index-metadata plugin and i think that the problem is when > Writable object is forced to Text > see > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > > ********************************** > // add the fields from crawldb > if (dbFieldnames != null) { > for (String metatag : dbFieldnames) { > Text metadata = (Text) datum.getMetaData().get(new Text(metatag)); > if (metadata != null) > doc.add(metatag, metadata.toString()); > } > } > *************************************** > The line 58 need to be changed also to this: > Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString(); > > If you agree i can do the jira ticket and patch for this. > > > > > > > > > > > > > ----- Mensaje original ----- > De: "Markus Jelsma" <[email protected]> > Para: [email protected] > Enviados: Lunes, 6 de Febrero 2017 17:54:39 > Asunto: [MASSMAIL]RE: make responseTime native in nutch > > Try this: > > <property> > <name>index.db.md</name> > <value></value> > <description> > Comma-separated list of keys to be taken from the crawldb metadata to > generate fields. > Can be used to index values propagated from the seeds with the plugin > urlmeta > </description> > </property> > > And enable index-metadata (iirc) plugin, you are good to go! > > Cheers, > Markus > > > > -----Original message----- >> From:Eyeris Rodriguez Rueda <[email protected]> >> Sent: Monday 6th February 2017 15:56 >> To: [email protected] >> Subject: make responseTime native in nutch >> >> Hi all. >> Nutch has a configuration that permit save responseTime for every url that >> is fetched, and this value is stored in crawl Datum under the key _rs_ but >> not indexed. >> Will be very usefull to index this value also. >> This value is very important in all cases and it is very easy to make this >> native in nutch. >> A little change to index basic plugin (or other) can make this happend. >> >> >> //index responseTime for each url if http.store.responsetime is true >> boolean property= conf.getBoolean("http.store.responsetime",true); >> if (property == true){ >> String value=datum.getMetaData().get(new Text("_rs_")).toString(); >> doc.add("responseTime",value); >> } >> >> I can do the jira ticket ant patch for this. >> What you think about it ? > > //End of email,text below is autogenerated by my email server. > La @universidad_uci es Fidel. Los jóvenes no fallaremos. > #HastaSiempreComandante > #HastalaVictoriaSiempre >

