Hi Elisabeth please see my comments on issue. Thanks again
Lewis On Thu, Jan 12, 2012 at 9:15 AM, Elisabeth Adler <[email protected]>wrote: > Hi Dean, > I added my documentation and bundled plugin to jira ( > https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809>), > hope this helps. > > > On 11.01.2012 22:44, Dean Del Ponte wrote: > >> Thank-you for your response. >> >> My goal is to get Nutch to index meta tags. It's been quite an adventure >> so far! >> >> On Wed, Jan 11, 2012 at 3:30 PM, Lewis John Mcgibbney< >> [email protected]> wrote: >> >> Hi Dean, >>> >>> Unfortunately nothing official. If you look you will see that this plugin >>> (if eventually integrated), will combine with two other issues which all >>> revolve roughly around the same area. >>> >>> I have never used this patch or any of the others. >>> >>> Anyone else? >>> >>> On Wed, Jan 11, 2012 at 8:54 PM, Dean Del Ponte<[email protected] >>> >>>> wrote: >>>> Any documentation on how to use the patch at >>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809> >>>> ? >>>> >>>> My apologies for the newbie question. >>>> >>>> Thanks, >>>> >>>> Dean Del Ponte >>>> >>>> On Mon, Sep 26, 2011 at 4:08 PM, Julien Nioche< >>>> [email protected]> wrote: >>>> >>>> Hi Matt, >>>>> >>>>> The plugin urlmeta does NOT extract the metadata from HTML pages. The >>>>> 'meta' >>>>> in its name means 'crawldb metadata' >>>>> >>>>> You need to use the patch in >>>>> https://issues.apache.org/**jira/browse/NUTCH-809<https://issues.apache.org/jira/browse/NUTCH-809> >>>>> >>>>> HTH >>>>> >>>>> Julien >>>>> >>>>> >>>>> On 26 September 2011 21:18, Wilson, Matt<Matthew.Wilson@salliemae.** >>>>> com <[email protected]> >>>>> >>>>>> wrote: >>>>>> Also, >>>>>> >>>>>> In case this helps. I removed the Keywords field from the solr >>>>>> >>>>> schema >>> >>>> to >>>> >>>>> see if it would generate an error when the SolrIndexer runs and it >>>>>> >>>>> does >>> >>>> not. >>>>> >>>>>> This has lead me to believe that nutch is either not indexing the >>>>>> >>>>> meta >>> >>>> content or it is not sending the update to solr when SolrIndexer >>>>>> >>>>> runs. >>> >>>> Matt Wilson >>>>>> >>>>>> -----Original Message----- >>>>>> From: lewis john mcgibbney >>>>>> [mailto:lewis.mcgibbney@gmail.**com<[email protected]> >>>>>> ] >>>>>> Sent: Monday, September 26, 2011 3:04 PM >>>>>> To: [email protected] >>>>>> Subject: Re: Indexing specific metadata tags with urlmeta >>>>>> >>>>>> Hi Matt, >>>>>> >>>>>> Try changing >>>>>> >>>>>> <field name="keywords" type="string" stored="true" indexed="true" >>>>>> multiValued="true"/> >>>>>> >>>>>> to >>>>>> >>>>>> <field name="Keywords" type="string" stored="true" indexed="true" >>>>>> multiValued="true"/> as per your metadata tags. >>>>>> >>>>>> We also have a configuration option in nutch-site.xml which you could >>>>>> >>>>> check >>>>> >>>>>> out. >>>>>> >>>>>> <property> >>>>>> <name>urlmeta.tags</name> >>>>>> <value></value> >>>>>> <description> >>>>>> To be used in conjunction with features introduced in NUTCH-655, >>>>>> >>>>> which >>>> >>>>> allows >>>>>> for custom metatags to be injected alongside your crawl URLs. >>>>>> >>>>> Specifying >>>>> >>>>>> those >>>>>> custom tags here will allow for their propagation into a pages >>>>>> >>>>> outlinks, >>>>> >>>>>> as >>>>>> well as allow for them to be included as part of an index. >>>>>> Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad >>>>>> >>>>> the >>> >>>> tags >>>>> >>>>>> with >>>>>> white-space at their boundaries, if you are using anything earlier >>>>>> >>>>> than >>>>> >>>>>> Hadoop-0.21. >>>>>> </description> >>>>>> </property> >>>>>> >>>>>> On Mon, Sep 26, 2011 at 7:07 PM, Wilson, Matt >>>>>> <[email protected]>**wrote: >>>>>> >>>>>> I am attempting to crawl a corporate intranet site and allow it to >>>>>>> >>>>>> be >>> >>>> searched in solr. As part of the requirements I have to be able to >>>>>>> >>>>>> index >>>>> >>>>>> certain metadata tags as their own field in solr (for faceted >>>>>>> >>>>>> search). >>>> >>>>> For >>>>>> >>>>>>> example, the pages being crawled contain the following meta tag: >>>>>>> >>>>>>> <meta id="ctl00_hdrKeywords" name="Keywords" content="Banking, >>>>>>> >>>>>> Savings, >>>> >>>>> Student Loans, CDs, Certificates of Deposit, Smart Option Loan, 529 >>>>>>> >>>>>> Plans" >>>>>> >>>>>>> /> >>>>>>> >>>>>>> I have updated the nutch-site.xml with the following: >>>>>>> >>>>>>> <property> >>>>>>> <name>plugin.includes</name> >>>>>>> <value>urlmeta|protocol-**httpclient|...</value> >>>>>>> </property> >>>>>>> <property> >>>>>>> <name>urlmeta.tags</name> >>>>>>> <value>keywords</value> >>>>>>> </property> >>>>>>> >>>>>>> I have updated the solr schema.xml with the following addition: >>>>>>> >>>>>>> <field name="keywords" type="string" stored="true" indexed="true" >>>>>>> multiValued="true"/> >>>>>>> >>>>>>> I can see that the field has been created in Solr via the admin >>>>>>> >>>>>> interface. >>>>>> >>>>>>> I also see that nutch is loading the urlmeta plugin and adding the >>>>>>> indexfilters etc in the hadroop.log. The problem is that nutch >>>>>>> >>>>>> does >>> >>>> not >>>>> >>>>>> appear to be indexing the keywords field. All of the pages crawled >>>>>>> >>>>>> have >>>>> >>>>>> the >>>>>> >>>>>>> tag present and I am receiving no errors in the nutch log. I am >>>>>>> >>>>>> unsure >>>> >>>>> as >>>>>> >>>>>>> to what I am missing. This seems to be pretty straightforward; >>>>>>> >>>>>> however, >>>>> >>>>>> I >>>>>> >>>>>>> must be misunderstanding either the urlmeta plugin or missing >>>>>>> >>>>>> something >>>> >>>>> in >>>>>> >>>>>>> the configuration. >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> *Lewis* >>>>>> >>>>>> >>>>>> This E-Mail has been scanned for viruses. >>>>>> >>>>>> >>>>> >>>>> -- >>>>> * >>>>> *Open Source Solutions for Text Engineering >>>>> >>>>> http://digitalpebble.blogspot.**com/<http://digitalpebble.blogspot.com/> >>>>> http://www.digitalpebble.com >>>>> >>>>> >>> >>> -- >>> *Lewis* >>> >>> -- *Lewis*

