Hi, I'm not sure what the problem was but I made some changes to MetaTagsIndexer.java, adding new doc.adds. Also I have updated the schema.xml and nutch-site.xml so perhaps there was some mismatch between fieldnames.
I had to do this to enable custom fields to be indexed. It would be great if this could be done in configuration only. Claus Daldorph Nielsen Theilgaard Mortensen a/s Julien Nioche <[email protected]> 25-05-2010 11:45 Please respond to [email protected] To [email protected] cc Subject Re: Parse and index meta tags in Nutch 1.0 Hi Claus, Glad you got it to work. Do you know what the problem was? BTW you can vote for issues you like in Jira - if enough people find this plugin useful I'll commit it to the trunk J. On 25 May 2010 08:57, Claus Daldorph Nielsen <[email protected]> wrote: > Julien, > > Thank you so much I really appreciate your help. I have now managed to get > Nutch to index meta tags in my Solr index (I am using Luke to verify that > the correct content is in my index). Only thing left now is to find out > how to search and get content from the new fields in Solr. > > > > Claus Daldorph Nielsen > > Theilgaard Mortensen a/s > > > > Julien Nioche <[email protected]> > 21-05-2010 17:18 > Please respond to > [email protected] > > > To > [email protected] > cc > > Subject > Re: Parse and index meta tags in Nutch 1.0 > > > > > > > You can : > - run *bin/nutch org.apache.nutch.parse.ParserChecker *and check that you > are getting metatag.* in the parse-metadata > - check in the log that the parse-metatags is really loaded > - run 'ant test-plugins' and see the output in build/parse-metatags > - check that you've added the field definitions in the SOLR schema > - index with Lucene and use Luke to check that the fields are created > > > On 21 May 2010 15:54, Claus Daldorph Nielsen <[email protected]> wrote: > > > I never got this to work. So if anybody have some ideas for debugging > then > > please post your ideas. > > > > The problem is that the meta tags are never found or added to the Solr > > index. I have no idea why. > > > > > > > > Claus Daldorph Nielsen > > > > Theilgaard Mortensen a/s > > Niels Hemmingsens gade 9 > > 1153 København K > > > > Tlf: 33448555 > > > > > > > > Julien Nioche <[email protected]> > > 21-05-2010 13:33 > > Please respond to > > [email protected] > > > > > > To > > [email protected] > > cc > > > > Subject > > Re: Parse and index meta tags in Nutch 1.0 > > > > > > > > > > > > > > Have you checked the discussion in > > > http://lucene.472066.n3.nabble.com/description-and-keywords-td690681.html? > > What have you modified in nutch-site.xml? > > > > j. > > > > On 21 May 2010 12:15, Claus Daldorph Nielsen <[email protected]> wrote: > > > > > Julien, > > > > > > Thanks it looks much like what I need. I have applied the patch and > > added > > > the lines to nutch-site.xml and then rebuild the Nutch project. But > > still > > > I don't see any metatags in my index. Do you have any suggestions to > > what > > > I might be doing wrong? Perhaps some configuration that I missed? > > > > > > > > > > > > Claus Daldorph Nielsen > > > > > > Theilgaard Mortensen a/s > > > Niels Hemmingsens gade 9 > > > 1153 København K > > > > > > Tlf: 33448555 > > > > > > > > > > > > Julien Nioche <[email protected]> > > > 21-05-2010 09:39 > > > Please respond to > > > [email protected] > > > > > > > > > To > > > [email protected] > > > cc > > > > > > Subject > > > Re: Parse and index meta tags in Nutch 1.0 > > > > > > > > > > > > > > > > > > > > > Claus, > > > > > > See https://issues.apache.org/jira/browse/NUTCH-809 and a related > > > discussion > > > on > > > > > > http://lucene.472066.n3.nabble.com/description-and-keywords-td690681.html > > > > > > Julien > > > > > > -- > > > DigitalPebble Ltd > > > http://www.digitalpebble.com > > > > > > On 21 May 2010 08:26, Claus Daldorph Nielsen <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > I am new to Nutch and trying to get Nutch to index meta tags from > html > > > > pages and store them for searching in Solr. The tags are on this > form: > > > > <meta name="TITLE" content="Some title" /> > > > > <meta name="KEYWORDS" content="Forum, help, build, stuff" /> > > > > > > > > I would like to store the tags as two different fields in the index. > I > > > > have tried the example explaining how to create a plugin but the > > example > > > > is for Nutch 0.9 and only helps me getting started. > > > > > > > > I think that I should look at : > > > > > > > > > > > > > > > > > > > > $NUTCH_HOME/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java > > > > > > > > and find the line: > > > > HTMLMetaProcessor.getMetaTags(metaTags, root, base); > > > > > > > > But I'm not sure how to go on from here. Any help would be > appreciated > > > and > > > > you are welcome to inform me if you know of an existing plugin that > > will > > > > index the meta tags. > > > > > > > > > > > > > > > > Claus Daldorph Nielsen > > > > > > > > Theilgaard Mortensen a/s > > > > > > > > > > > > -- > > DigitalPebble Ltd > > http://www.digitalpebble.com > > > > > > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > -- DigitalPebble Ltd http://www.digitalpebble.com

