Hi, > https://issues.apache.org/jira/browse/NUTCH-2357
thanks! > There are some plugins that add fields to doc that will be indexed, but i > can't find one place that describe every fields that nutch send to index. conf/schema.xml - should list all fields filled by core or any of the indexing filter plugins Yes, names are hard-coded, a change to the name of an index field must be applied to both schema and Java code. Ideally, the fields are also listed in the wiki https://wiki.apache.org/nutch/IndexStructure (but currently, some plugins and their fields are not listed there) > I think that before to add a field to a doc, nutch should check if that field > is present in schema.xml or not. Both IndexingFilters which add fields and indexer plugins (interface IndexWriter) are plugins and should work independently. A possible improvement could be to add methods which let - indexing filters announce the filled fields, resp. - index writers list the required or optionally accepted fields The indexing job can then check in advance for undeclared index fields. Best, Sebastian On 02/07/2017 03:22 PM, Eyeris Rodriguez Rueda wrote: > Thanks Sebastian. > I have open a ticket for the problem in index metadata > This the url. > https://issues.apache.org/jira/browse/NUTCH-2357 > > > > About the responseTime in index more its looks great. > One new field is needed into indexer also(solr,elastic),if not, nutch will > thrown an Exception. > > One more thing. > There are some plugins that add fields to doc that will be indexed, but i > can't find one place that describe every fields that nutch send to index. > Index basic send (domain,host,url,content,title,cache,tstamp). > Index more send (type,date,contentLength) > and others. > I have look into nutch code but i think that nutch don't use schema.xml > Is there any way to know all fields that nutch send into indexer (solr or > other)? > i mean, apart of look at the code of every index-* plugin of course. > If i delete one of this fields into solr, then nutch thrown an Exception, > because every field is named static in code, for example > > doc.add("host", host); > I think that before to add a field to a doc, nutch should check if that field > is present in schema.xml or not. > > > > > > > > > ----- Mensaje original ----- > De: "Sebastian Nagel" <[email protected]> > Para: [email protected] > Enviados: Martes, 7 de Febrero 2017 4:28:03 > Asunto: Re: [MASSMAIL]RE: make responseTime native in nutch > > Hi, > >> Y have looked index-metadata plugin and i think that the problem is when >> Writable object is forced > to Text >> see >> > https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 > > Good catch! The metadata value must implement Writable and must not > necessarily be an instance of > Text. The actual class doesn't matter, the toString() method should return > something meaningful. > Please, open a Jira issue to fix this problem of index-metadata. > > @Eyeris: it's on you whether to open a separate issue to make the response > time a "built-in" index > field. We could add it to index-more which already provides fields related > to the crawling: > last-modified, MIME type, content length. > > Best, > Sebastian > > On 02/07/2017 12:08 AM, Eyeris Rodriguez Rueda wrote: >> Hello Markus. >> I have tried your recomendation using >> <property> >> <name>index.db.mdname> >> <value>_rs_value> >> <description> >> Comma-separated list of keys to be taken from the crawldb metadata to >> generate fields. >> Can be used to index values propagated from the seeds with the plugin >> urlmeta >> description> >> property> >> >> >> but i get the Exception(see below), by the indexer. >> ****************************************************************** >> 2017-02-06 18:18:28,905 INFO anchor.AnchorIndexingFilter - Anchor >> deduplication is: on >> 2017-02-06 18:18:29,024 INFO more.MoreIndexingFilter - Reading content type >> mappings from file contenttype-mapping.txt >> 2017-02-06 18:18:29,849 INFO indexer.IndexWriters - Adding >> org.apache.nutch.indexwriter.solr.SolrIndexWriter >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: content dest: >> content >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: title dest: >> title >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: host dest: >> host >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: segment dest: >> segment >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: boost dest: >> boost >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: digest dest: >> digest >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: tstamp dest: >> tstamp >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: >> metatag.description dest: description >> 2017-02-06 18:18:29,969 INFO solr.SolrMappingReader - source: >> metatag.keywords dest: keywords >> 2017-02-06 18:18:30,134 WARN mapred.LocalJobRunner - job_local15168888_0001 >> java.lang.Exception: java.lang.ClassCastException: >> org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.Text >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) >> Caused by: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable >> cannot be cast to org.apache.hadoop.io.Text >> at >> org.apache.nutch.indexer.metadata.MetadataIndexer.filter(MetadataIndexer.java:58) >> at >> org.apache.nutch.indexer.IndexingFilters.filter(IndexingFilters.java:51) >> at >> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:330) >> at >> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:56) >> at >> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:444) >> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> 2017-02-06 18:18:30,777 ERROR indexer.IndexingJob - Indexer: >> java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) >> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) >> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) >> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) >> ****************************************************************** >> >> Y have looked index-metadata plugin and i think that the problem is when >> Writable object is forced to Text >> see >> https://github.com/apache/nutch/blob/master/src/plugin/index-metadata/src/java/org/apache/nutch/indexer/metadata/MetadataIndexer.java#L58 >> >> ********************************** >> // add the fields from crawldb >> if (dbFieldnames != null) { >> for (String metatag : dbFieldnames) { >> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)); >> if (metadata != null) >> doc.add(metatag, metadata.toString()); >> } >> } >> *************************************** >> The line 58 need to be changed also to this: >> Text metadata = (Text) datum.getMetaData().get(new Text(metatag)).toString(); >> >> If you agree i can do the jira ticket and patch for this. >> >> >> >> >> >> >> >> >> >> >> >> >> ----- Mensaje original ----- >> De: "Markus Jelsma" <[email protected]> >> Para: [email protected] >> Enviados: Lunes, 6 de Febrero 2017 17:54:39 >> Asunto: [MASSMAIL]RE: make responseTime native in nutch >> >> Try this: >> >> <property> >> <name>index.db.mdname> >> <value>value> >> <description> >> Comma-separated list of keys to be taken from the crawldb metadata to >> generate fields. >> Can be used to index values propagated from the seeds with the plugin >> urlmeta >> description> >> property> >> >> And enable index-metadata (iirc) plugin, you are good to go! >> >> Cheers, >> Markus >> >> >> >> -----Original message----- >>> From:Eyeris Rodriguez Rueda <[email protected]> >>> Sent: Monday 6th February 2017 15:56 >>> To: [email protected] >>> Subject: make responseTime native in nutch >>> >>> Hi all. >>> Nutch has a configuration that permit save responseTime for every url that >>> is fetched, and this value is stored in crawl Datum under the key _rs_ but >>> not indexed. >>> Will be very usefull to index this value also. >>> This value is very important in all cases and it is very easy to make this >>> native in nutch. >>> A little change to index basic plugin (or other) can make this happend. >>> >>> >>> //index responseTime for each url if http.store.responsetime is true >>> boolean property= conf.getBoolean("http.store.responsetime",true); >>> if (property == true){ >>> String value=datum.getMetaData().get(new Text("_rs_")).toString(); >>> doc.add("responseTime",value); >>> } >>> >>> I can do the jira ticket ant patch for this. >>> What you think about it ? >> >> //End of email,text below is autogenerated by my email server. > > La @universidad_uci es Fidel. Los jóvenes no fallaremos. > #HastaSiempreComandante > #HastalaVictoriaSiempre >

