Hi all, I am facing a few problems in using the urlmeta plugin as described in the pluginwiki.
What I tried so far :- - followed the wiki page to use urlmeta plugin with a <meta > tag in one of my webpages and haven't got indexed as I expected. was not even showing the metatag in the readseg dumps - then I tried giving the metatags along with the urls in the seed file ( tab seperated ). Meta tags showed up in the dump but querying solr gave no results. - then I applied the NUTCH-809 patch >> built nutch etc ... to see if that works. Same result as the first case ...not in metadata field (readseg dump ) and not in solr results. - Checking the index with Luke showed NO field named "keywords" (my metatag name) So what could be the issue here. Also I want to know what is the difference between urlmeta and index-metatags plugins and their exact uses. I am bit confused when urlmeta wiki tells about adding a <meta> tag to your html page and still not indexing it and there exists another plugin index-metatags for the same. I am a newbie to this please help.. On Thu, Jan 12, 2012 at 5:20 PM, Lewis John Mcgibbney < [email protected]> wrote: > Hi Elisabeth please see my comments on issue. > > Thanks again > > Lewis > > On Thu, Jan 12, 2012 at 9:15 AM, Elisabeth Adler > <[email protected]>wrote: > > > Hi Dean, > > I added my documentation and bundled plugin to jira ( > > https://issues.apache.org/**jira/browse/NUTCH-809< > https://issues.apache.org/jira/browse/NUTCH-809>), > > hope this helps. > > > > > > On 11.01.2012 22:44, Dean Del Ponte wrote: > > > >> Thank-you for your response. > >> > >> My goal is to get Nutch to index meta tags. It's been quite an > adventure > >> so far! > >> > >> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney< > >> [email protected]> wrote: > >> > >> Hi Dean, > >>> > >>> Unfortunately nothing official. If you look you will see that this > plugin > >>> (if eventually integrated), will combine with two other issues which > all > >>> revolve roughly around the same area. > >>> > >>> I have never used this patch or any of the others. > >>> > >>> Anyone else? > >>> > >>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte< > [email protected] > >>> > >>>> wrote: > >>>> Any documentation on how to use the patch at > >>>> https://issues.apache.org/**jira/browse/NUTCH-809< > https://issues.apache.org/jira/browse/NUTCH-809> > >>>> ? > >>>> > >>>> My apologies for the newbie question. > >>>> > >>>> Thanks, > >>>> > >>>> Dean Del Ponte > >>>> > >>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche< > >>>> [email protected]> wrote: > >>>> > >>>> Hi Matt, > >>>>> > >>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The > >>>>> 'meta' > >>>>> in its name means 'crawldb metadata' > >>>>> > >>>>> You need to use the patch in > >>>>> https://issues.apache.org/**jira/browse/NUTCH-809< > https://issues.apache.org/jira/browse/NUTCH-809> > >>>>> > >>>>> HTH > >>>>> > >>>>> Julien > >>>>> > >>>>> > >>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.** > >>>>> com <[email protected]> > >>>>> > >>>>>> wrote: > >>>>>> Also, > >>>>>> > >>>>>> In case this helps. I removed the Keywords field from the solr > >>>>>> > >>>>> schema > >>> > >>>> to > >>>> > >>>>> see if it would generate an error when the SolrIndexer runs and it > >>>>>> > >>>>> does > >>> > >>>> not. > >>>>> > >>>>>> This has lead me to believe that nutch is either not indexing the > >>>>>> > >>>>> meta > >>> > >>>> content or it is not sending the update to solr when SolrIndexer > >>>>>> > >>>>> runs. > >>> > >>>> Matt Wilson > >>>>>> > >>>>>> -----Original Message----- > >>>>>> From: lewis john mcgibbney [mailto:lewis.mcgibbney@gmail.**com< > [email protected]> > >>>>>> ] > >>>>>> Sent: Monday, September 26, 2011 3:04 PM > >>>>>> To: [email protected] > >>>>>> Subject: Re: Indexing specific metadata tags with urlmeta > >>>>>> > >>>>>> Hi Matt, > >>>>>> > >>>>>> Try changing > >>>>>> > >>>>>> <field name="keywords" type="string" stored="true" indexed="true" > >>>>>> multiValued="true"/> > >>>>>> > >>>>>> to > >>>>>> > >>>>>> <field name="Keywords" type="string" stored="true" indexed="true" > >>>>>> multiValued="true"/> as per your metadata tags. > >>>>>> > >>>>>> We also have a configuration option in nutch-site.xml which you > could > >>>>>> > >>>>> check > >>>>> > >>>>>> out. > >>>>>> > >>>>>> <property> > >>>>>> <name>urlmeta.tags</name> > >>>>>> <value></value> > >>>>>> <description> > >>>>>> To be used in conjunction with features introduced in NUTCH-655, > >>>>>> > >>>>> which > >>>> > >>>>> allows > >>>>>> for custom metatags to be injected alongside your crawl URLs. > >>>>>> > >>>>> Specifying > >>>>> > >>>>>> those > >>>>>> custom tags here will allow for their propagation into a pages > >>>>>> > >>>>> outlinks, > >>>>> > >>>>>> as > >>>>>> well as allow for them to be included as part of an index. > >>>>>> Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad > >>>>>> > >>>>> the > >>> > >>>> tags > >>>>> > >>>>>> with > >>>>>> white-space at their boundaries, if you are using anything > earlier > >>>>>> > >>>>> than > >>>>> > >>>>>> Hadoop-0.21. > >>>>>> </description> > >>>>>> </property> > >>>>>> > >>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt > >>>>>> <[email protected]>**wrote: > >>>>>> > >>>>>> I am attempting to crawl a corporate intranet site and allow it to > >>>>>>> > >>>>>> be > >>> > >>>> searched in solr. As part of the requirements I have to be able to > >>>>>>> > >>>>>> index > >>>>> > >>>>>> certain metadata tags as their own field in solr (for faceted > >>>>>>> > >>>>>> search). > >>>> > >>>>> For > >>>>>> > >>>>>>> example, the pages being crawled contain the following meta tag: > >>>>>>> > >>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, > >>>>>>> > >>>>>> Savings, > >>>> > >>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 > >>>>>>> > >>>>>> Plans" > >>>>>> > >>>>>>> /> > >>>>>>> > >>>>>>> I have updated the nutch-site.xml with the following: > >>>>>>> > >>>>>>> <property> > >>>>>>> <name>plugin.includes</name> > >>>>>>> <value>urlmeta|protocol-**httpclient|...</value> > >>>>>>> </property> > >>>>>>> <property> > >>>>>>> <name>urlmeta.tags</name> > >>>>>>> <value>keywords</value> > >>>>>>> </property> > >>>>>>> > >>>>>>> I have updated the solr schema.xml with the following addition: > >>>>>>> > >>>>>>> <field name="keywords" type="string" stored="true" indexed="true" > >>>>>>> multiValued="true"/> > >>>>>>> > >>>>>>> I can see that the field has been created in Solr via the admin > >>>>>>> > >>>>>> interface. > >>>>>> > >>>>>>> I also see that nutch is loading the urlmeta plugin and adding the > >>>>>>> indexfilters etc in the hadroop.log. The problem is that nutch > >>>>>>> > >>>>>> does > >>> > >>>> not > >>>>> > >>>>>> appear to be indexing the keywords field. All of the pages crawled > >>>>>>> > >>>>>> have > >>>>> > >>>>>> the > >>>>>> > >>>>>>> tag present and I am receiving no errors in the nutch log. I am > >>>>>>> > >>>>>> unsure > >>>> > >>>>> as > >>>>>> > >>>>>>> to what I am missing. This seems to be pretty straightforward; > >>>>>>> > >>>>>> however, > >>>>> > >>>>>> I > >>>>>> > >>>>>>> must be misunderstanding either the urlmeta plugin or missing > >>>>>>> > >>>>>> something > >>>> > >>>>> in > >>>>>> > >>>>>>> the configuration. > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> *Lewis* > >>>>>> > >>>>>> > >>>>>> This E-Mail has been scanned for viruses. > >>>>>> > >>>>>> > >>>>> > >>>>> -- > >>>>> * > >>>>> *Open Source Solutions for Text Engineering > >>>>> > >>>>> http://digitalpebble.blogspot.**com/< > http://digitalpebble.blogspot.com/> > >>>>> http://www.digitalpebble.com > >>>>> > >>>>> > >>> > >>> -- > >>> *Lewis* > >>> > >>> > > > -- > *Lewis* > -- *Thanks & Regards* * * *Vijith V*

